Dynamic Multi-modal Prompting for Efficient Visual Grounding

Wu, Wansen; Liu, Ting; Wang, Youkai; Xu, Kai; Yin, Quanjun; Hu, Yue

doi:10.1007/978-981-99-8540-1_29

Wansen Wu¹⁵,
Ting Liu¹⁵,
Youkai Wang¹⁵,
Kai Xu¹⁵,
Quanjun Yin¹⁵ &
…
Yue Hu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14431))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

393 Accesses

Abstract

Prompt tuning has emerged as a flexible approach for adapting pre-trained models by solely learning additional inputs while keeping the model parameters frozen. However, simplistic prompts are insufficient to effectively address the challenges posed by complex multi-modal tasks such as visual grounding. In this paper, we propose a novel prompting architecture called Dynamic Multi-modAl Prompting (DMAP) for visual grounding. DMAP incorporates input-dependent prompting to tailor instance-level prompts for more accurate representation and dynamic multi-modal prompting to capture the relationship between the textual and visual inputs. To this end, we design a Dynamic Prompt Network (DPN) to generate multi-modal prompts based on the specific inputs, enhancing both adaptive prompt generation and multi-modal feature fusion. Extensive experimental results demonstrate the superiority of DMAP over competing methods in parameter-efficient settings. Furthermore, DMAP consistently outperforms state-of-the-art VG methods even when fine-tuning all parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Bahng, H., Jahanian, A., Sankaranarayanan, S., Isola, P.: Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, vol. 1, no. 3, p. 4 (2022)
Brown, T.B., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, L., Ma, W., Xiao, J., Zhang, H., Chang, S.F.: Ref-NMS: breaking proposal bottlenecks in two-stage referring expression grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1036–1044 (2021)
Google Scholar
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1769–1779 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hong, R., Liu, D., Mo, X., He, X., Zhang, H.: Learning to compose and reason with language tree structures for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. 44(2), 684–696 (2019)
Article Google Scholar
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: ICML. Proceedings of Machine Learning Research, vol. 97, pp. 2790–2799. PMLR (2019)
Google Scholar
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1115–1124 (2017)
Google Scholar
Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: Maple: multi-modal prompt learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19113–19122 (2023)
Google Scholar
Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: towards ordinal action understanding in instructional videos. In: CVPR, pp. 19848–19857. IEEE (2022)
Google Scholar
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: ACL/IJCNLP (1), pp. 4582–4597. Association for Computational Linguistics (2021)
Google Scholar
Li, Y., et al.: A deep learning-based hybrid framework for object detection and recognition in autonomous driving. IEEE Access 8, 194228–194239 (2020)
Article Google Scholar
Lialin, V., Deshpande, V., Rumshisky, A.: Scaling down to scale up: a guide to parameter-efficient fine-tuning. CoRR abs/2303.15647 (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Loedeman, J., Stol, M.C., Han, T., Asano, Y.M.: Prompt generation networks for efficient adaptation of frozen vision transformers. CoRR abs/2210.06466 (2022)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (Poster). OpenReview.net (2019)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Google Scholar
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Chapter Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
Google Scholar
Wu, W., Chang, T., Li, X.: Visual-and-language navigation: a survey and taxonomy. arXiv preprint arXiv:2108.11544 (2021)
Yang, Z., Chen, T., Wang, L., Luo, J.: Improving one-stage visual grounding by recursive sub-query construction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_23
Chapter Google Scholar
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4683–4693 (2019)
Google Scholar
Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Chapter Google Scholar
Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4158–4166 (2018)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16795–16804. IEEE (2022)
Google Scholar
Zhou, Y., et al.: TRAR: routing the attention spans in transformer for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2074–2084 (2021)
Google Scholar

Download references

Acknowledgement

This research was supported partially by the National Natural Science Fund of China (Grant Nos. 62306329, 62103420, 62103425 and 62103428) and the Natural Science Fund of Hunan Province (Grant Nos. 2021JJ40697, 2021JJ40702, 2022JJ40559 and 2023JJ40676), and Hunan Provincial Innovation Foundation For Postgraduate.

Author information

Authors and Affiliations

College of Systems Engineering, National University of Defense Technology, Changsha, 410003, China
Wansen Wu, Ting Liu, Youkai Wang, Kai Xu, Quanjun Yin & Yue Hu

Authors

Wansen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ting Liu
View author publications
You can also search for this author in PubMed Google Scholar
Youkai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Xu
View author publications
You can also search for this author in PubMed Google Scholar
Quanjun Yin
View author publications
You can also search for this author in PubMed Google Scholar
Yue Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Hu .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, W., Liu, T., Wang, Y., Xu, K., Yin, Q., Hu, Y. (2024). Dynamic Multi-modal Prompting for Efficient Visual Grounding. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14431. Springer, Singapore. https://doi.org/10.1007/978-981-99-8540-1_29

Download citation

DOI: https://doi.org/10.1007/978-981-99-8540-1_29
Published: 25 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8539-5
Online ISBN: 978-981-99-8540-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dynamic Multi-modal Prompting for Efficient Visual Grounding