GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

Li, Yi-Ting; Lin, Ying-Jia; Yeh, Chia-Jen; Lin, Chun-Yi; Kao, Hung-Yu

doi:10.1007/978-981-97-2266-2_7

Yi-Ting Li¹³,
Ying-Jia Lin¹³,
Chia-Jen Yeh¹³,
Chun-Yi Lin¹³ &
…
Hung-Yu Kao¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14650))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

159 Accesses

Abstract

The WSDM 2023 Toloka VQA challenge introduces a new Grounding-based Visual Question Answering (GVQA) dataset, elevating multimodal task complexity. This challenge diverges from traditional VQA by requiring models to identify a bounding box in response to an image-question pair, aligning with Visual Grounding tasks. Existing VG approaches, when applied to GVQA, often necessitate external data or larger models for satisfactory results, leading to high computational demands. We approach this as a language modeling problem, utilizing prompt tuning with multiple state-of-the-art VQA models. Our method, operating solely on an NVIDIA RTX3090 GPU without external data, secured third place in the challenge, achieving an Intersection over Union (IoU) of 75.658. Our model notably provides explainability between textual and visual data through its attention mechanism, offering insights into its decision-making process. This research demonstrates that high performance in GVQA can be achieved with minimal resources, enhancing understanding of model dynamics and paving the way for improved interpretability and efficiency. Our code is available here: https://github.com/IKMLab/GViG.git

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: A language modeling framework for object detection. In: International Conference on Learning Representations (2021)
Google Scholar
Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. Adv. Neural. Inf. Process. Syst. 34, 3965–3977 (2021)
Google Scholar
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1769–1779 (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Gao, S., Chen, Z., Chen, G., Wang, W., Lu, T.: Champion solution for the WSDM2023 toloka VQA challenge. arXiv preprint arXiv:2301.09045 (2023)
Gao, T., Fisch, A., Chen, D.: Making pre-trained language models better few-shot learners. In: Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021, pp. 3816–3830. Association for Computational Linguistics (ACL) (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, S., et al.: Referring image segmentation via cross-modal progressive comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10488–10497 (2020)
Google Scholar
Jin, W., Cheng, Y., Shen, Y., Chen, W., Ren, X.: A good prompt is worth millions of parameters: low-resource prompt-based learning for vision-language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2763–2775 (2022)
Google Scholar
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Komleva, E.: WSDM2023 VQA. https://github.com/EvgeniaKomleva/WSDM2023_VQA (2023)
Krishna, R., et al.: Visual Genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Li, C., et al.: mPLUG: effective and efficient vision-language learning by cross-modal skip-connections. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7241–7259. Association for Computational Linguistics (Dec 2022). https://aclanthology.org/2022.emnlp-main.488
Liu, J., et al.: PolyFormer: referring image segmentation as sequential polygon generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18653–18663 (2023)
Google Scholar
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)
Article Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016)
Google Scholar
Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. Adv. Neural. Inf. Process. Syst. 34, 200–212 (2021)
Google Scholar
Ustalov, D., Pavlichenko, N., Likhobaba, D., Smirnova, A.: WSDM cup 2023 challenge on visual question answering (2023)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems, vol. 30 (2017)
Google Scholar
Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, pp. 23318–23340. PMLR (2022)
Google Scholar
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4683–4693 (2019)
Google Scholar
Zhang, H., Wong, K.: VQA. https://github.com/Hyu-Zhang/VQA (2023)

Download references

Acknowledgement

This work was supported by the National Science and Technology Council, Taiwan, under Grant NSTC 112-2223-E-006-009.

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan City, Taiwan
Yi-Ting Li, Ying-Jia Lin, Chia-Jen Yeh, Chun-Yi Lin & Hung-Yu Kao

Authors

Yi-Ting Li
View author publications
You can also search for this author in PubMed Google Scholar
Ying-Jia Lin
View author publications
You can also search for this author in PubMed Google Scholar
Chia-Jen Yeh
View author publications
You can also search for this author in PubMed Google Scholar
Chun-Yi Lin
View author publications
You can also search for this author in PubMed Google Scholar
Hung-Yu Kao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hung-Yu Kao .

Editor information

Editors and Affiliations

Taipei, Taiwan
De-Nian Yang
Microsoft Research Asia, Beijing, China
Xing Xie
National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Vincent S. Tseng
Duke University, Durham, NC, USA
Jian Pei
National Cheng Kung University, Tainan, Taiwan
Jen-Wei Huang
Silesian University of Technology, Gliwice, Poland
Jerry Chun-Wei Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, YT., Lin, YJ., Yeh, CJ., Lin, CY., Kao, HY. (2024). GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14650. Springer, Singapore. https://doi.org/10.1007/978-981-97-2266-2_7

Download citation

DOI: https://doi.org/10.1007/978-981-97-2266-2_7
Published: 25 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2265-5
Online ISBN: 978-981-97-2266-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering