Hierarchical cross-modal contextual attention network for visual grounding

Xu, Xin; Lv, Gang; Sun, Yining; Hu, Yuxia; Nian, Fudong

doi:10.1007/s00530-023-01097-8

Hierarchical cross-modal contextual attention network for visual grounding

Regular Paper
Published: 17 April 2023

Volume 29, pages 2073–2083, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Xin Xu¹^na1,
Gang Lv^4,5^na1,
Yining Sun^4,5,
Yuxia Hu³ &
…
Fudong Nian^1,2,3

287 Accesses
Explore all metrics

Abstract

This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CMMix: Cross-Modal Mix Augmentation Between Images and Texts for Visual Grounding

Improving weakly supervised phrase grounding via visual representation contextualization with contrastive learning

Article 02 November 2022

A Novel Approach to Visual Linguistics by Assessing Multi-level Language Substructures

Data availability

The datasets analyzed during the current study are available from https://github.com/BryanPlummer/flickr30k_entities (Flickr30K Entities), http://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip (RefCOCO), http://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip (RefCOCO+), and http://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip (RefCOCOg).

Notes

https://github.com/Lyken17/pytorch-OpCounter.

References

Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229 (2020)
Chen, L., Ma, W., Xiao, J., et al.: Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1036–1044 (2021)
Chen, X., Ma, L., Chen, J., et al.: Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018)
Cui, R., Qian, T., Peng, P., et al.: Video moment retrieval from text queries via single frame annotation. arXiv preprint arXiv:2204.09409 (2022)
Deng, C., Wu, Q., Wu, Q., et al.: Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755 (2018)
Deng, J., Yang, Z., Chen, T., et al.: Transvg: End-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1769–1779 (2021)
Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Du, Y., Fu, Z., Liu, Q., et al.: Visual grounding with transformers. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 (2022)
Gabeur, V., Sun, C., Alahari, K., et al.: Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 214–229 (2020)
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448 (2015)
Han, K., Wang, Y., Chen, H., et al.: A survey on vision transformer. IEEE Transact. Patt. Anal. Mach. Intell. 45, 87–110 (2022)
Article Google Scholar
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016)
Hong, R., Liu, D., Mo, X., et al.: Learning to compose and reason with language tree structures for visual grounding. IEEE Transact. Patt. Anal. Mach. Intell. 44, 684–96 (2019)
Article Google Scholar
Hu, R., Rohrbach, M., Andreas, J., et al.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124 (2017)
Huang, B., Lian, D., Luo, W., et al.: Look before you leap: Learning landmark features for one-stage visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16,888–16,897 (2021)
Jiao, Y., Jie, Z., Chen, J., et al.: Suspected object matters: Rethinking model’s prediction for one-stage visual grounding. arXiv preprint arXiv:2203.05186 (2022)
Kovaleva, O., Romanov, A., Rogers, A., et al.: Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593 (2019)
Kovvuri, R., Nevatia, R.: Pirc net: Using proposal indexing, relationships and context for phrase grounding. In: Asian Conference on Computer Vision, Springer, pp 451–467 (2018)
Liao, Y., Liu, S., Li, G., et al.: A real-time cross-modality correlation filtering method for referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10,880–10,889 (2020)
Lin, T.Y., Maire, M., Belongie, S., et al.: Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755 (2014)
Liu, D., Zhang, H., Wu, F., et al.: Learning to assemble neural module tree networks for visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4673–4682 (2019a)
Liu, X., Wang, Z., Shao, J., et al.: Improving referring expression grounding with cross-modal attention-guided erasing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1950–1959 (2019b)
Liu, Y., Li, S., Wu, Y., et al.: Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3042–3051 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Mao, J., Huang, J., Toshev, A., et al.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11–20 (2016)
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer, pp 792–807 (2016)
Parmar, N., Vaswani, A., Uszkoreit, J., et al.: Image transformer. In: International conference on machine learning, PMLR, pp 4055–4064 (2018)
Plummer, B.A., Wang, L., Cervantes, C.M., et al.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp 2641–2649 (2015)
Plummer, B.A., Kordas, P., Kiapour, M.H., et al.: Conditional image-text embedding networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 249–264 (2018)
Qian, S., Wang, J., Hu, J., et al.: Hierarchical multi-modal contextual attention network for fake news detection. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 153–162 (2021)
Qiao, Y., Deng, C., Wu, Q.: Referring expression comprehension: a survey of methods and datasets. IEEE Transact. Multimedia 23, 4426–4440 (2020)
Article Google Scholar
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Rezatofighi, H., Tsoi, N., Gwak, J., et al.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666 (2019)
Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4694–4703 (2019)
Song, Y., Wang, J., Liang, Z., et al.: Utilizing bert intermediate layers for aspect based sentiment analysis and natural language inference. arXiv preprint arXiv:2002.04815 (2020)
Wang, L., Li, Y., Huang, J., et al.: Learning two-branch neural networks for image-text matching tasks. IEEE Transact. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)
Article Google Scholar
Wang, P., Wu, Q., Cao, J., et al.: Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1960–1968 (2019)
Wu, P., He, X., Tang, M., et al.: Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on Multimedia, pp 3518–3527 (2021)
Yang, C., Wang, G., Li, D., et al.: Ppgn: Phrase-guided proposal generation network for referring expression comprehension. arXiv preprint arXiv:2012.10890 (2020a)
Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4644–4653 (2019a)
Yang, S., Li, G., Yu, Y.: Propagating over phrase relations for one-stage visual grounding. In: European Conference on Computer Vision, Springer, pp 589–605 (2020b)
Yang, Z., Gong, B., Wang, L., et al.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4683–4693 (2019b)
Yang, Z., Chen, T., Wang, L., et al.: Improving one-stage visual grounding by recursive sub-query construction. In: European Conference on Computer Vision, Springer, pp 387–404 (2020c)
Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transact. Assoc. Computat. Linguistics 2, 67–78 (2014)
Article Google Scholar
Yu, L., Poirson, P., Yang, S., et al.: Modeling context in referring expressions. In: European Conference on Computer Vision, Springer, pp 69–85 (2016)
Yu, L., Lin, Z., Shen, X., et al.: Mattnet: Modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1307–1315 (2018a)
Yu, Z., Yu, J., Xiang, C., et al.: Rethinking diversified and discriminative proposal generation for visual grounding. arXiv preprint arXiv:1805.03508 (2018b)
Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4158–4166 (2018)
Zhuang, B., Wu, Q., Shen, C., et al.: Parallel attention: A unified framework for visual object discovery through dialogs and queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4252–4261 (2018)

Download references

Acknowledgements

This work was supported in part by the University Synergy Innovation Program of Anhui Province (No. GXXT-2022-043, No. GXXT-2022-037), Anhui Provincial Key Research and Development Program (No. 2022a05020042), Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi-Dimensional Modeling (No. GJZZX2021KF01) and National Natural Science Foundation of China (No. 61902104, 62105002).

Author information

Xin Xu and Gang Lv contributed equally to this work.

Authors and Affiliations

School of Advanced Manufacturing Engineering, Hefei University, Hefei, 230601, China
Xin Xu & Fudong Nian
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230088, China
Fudong Nian
Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi-Dimensional Modeling, Anhui Jianzhu University, Hefei, 230601, China
Yuxia Hu & Fudong Nian
School of Information Science and Technology, University of Science and Technology of China, Hefei, 230026, China
Gang Lv & Yining Sun
Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, China
Gang Lv & Yining Sun

Authors

Xin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Gang Lv
View author publications
You can also search for this author in PubMed Google Scholar
Yining Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yuxia Hu
View author publications
You can also search for this author in PubMed Google Scholar
Fudong Nian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

XX and GL: wrote the main manuscript text. YS: resources, supervision. WN: software, data curation. YH: visualization. FN: conceptualization, methodology, funding acquisition, writing - review and editing.

Corresponding author

Correspondence to Fudong Nian.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, X., Lv, G., Sun, Y. et al. Hierarchical cross-modal contextual attention network for visual grounding. Multimedia Systems 29, 2073–2083 (2023). https://doi.org/10.1007/s00530-023-01097-8

Download citation

Received: 11 January 2023
Accepted: 10 April 2023
Published: 17 April 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s00530-023-01097-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical cross-modal contextual attention network for visual grounding

Abstract

Access this article

Similar content being viewed by others

CMMix: Cross-Modal Mix Augmentation Between Images and Texts for Visual Grounding

Improving weakly supervised phrase grounding via visual representation contextualization with contrastive learning

A Novel Approach to Visual Linguistics by Assessing Multi-level Language Substructures

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical cross-modal contextual attention network for visual grounding

Abstract

Access this article

Similar content being viewed by others

CMMix: Cross-Modal Mix Augmentation Between Images and Texts for Visual Grounding

Improving weakly supervised phrase grounding via visual representation contextualization with contrastive learning

A Novel Approach to Visual Linguistics by Assessing Multi-level Language Substructures

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation