Skip to main content
Log in

Hierarchical cross-modal contextual attention network for visual grounding

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The datasets analyzed during the current study are available from https://github.com/BryanPlummer/flickr30k_entities (Flickr30K Entities), http://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip (RefCOCO), http://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip (RefCOCO+), and http://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip (RefCOCOg).

Notes

  1. https://github.com/Lyken17/pytorch-OpCounter.

References

  1. Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229 (2020)

  2. Chen, L., Ma, W., Xiao, J., et al.: Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1036–1044 (2021)

  3. Chen, X., Ma, L., Chen, J., et al.: Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018)

  4. Cui, R., Qian, T., Peng, P., et al.: Video moment retrieval from text queries via single frame annotation. arXiv preprint arXiv:2204.09409 (2022)

  5. Deng, C., Wu, Q., Wu, Q., et al.: Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755 (2018)

  6. Deng, J., Yang, Z., Chen, T., et al.: Transvg: End-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1769–1779 (2021)

  7. Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  8. Du, Y., Fu, Z., Liu, Q., et al.: Visual grounding with transformers. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 (2022)

  9. Gabeur, V., Sun, C., Alahari, K., et al.: Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 214–229 (2020)

  10. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448 (2015)

  11. Han, K., Wang, Y., Chen, H., et al.: A survey on vision transformer. IEEE Transact. Patt. Anal. Mach. Intell. 45, 87–110 (2022)

    Article  Google Scholar 

  12. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016)

  13. Hong, R., Liu, D., Mo, X., et al.: Learning to compose and reason with language tree structures for visual grounding. IEEE Transact. Patt. Anal. Mach. Intell. 44, 684–96 (2019)

    Article  Google Scholar 

  14. Hu, R., Rohrbach, M., Andreas, J., et al.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124 (2017)

  15. Huang, B., Lian, D., Luo, W., et al.: Look before you leap: Learning landmark features for one-stage visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16,888–16,897 (2021)

  16. Jiao, Y., Jie, Z., Chen, J., et al.: Suspected object matters: Rethinking model’s prediction for one-stage visual grounding. arXiv preprint arXiv:2203.05186 (2022)

  17. Kovaleva, O., Romanov, A., Rogers, A., et al.: Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593 (2019)

  18. Kovvuri, R., Nevatia, R.: Pirc net: Using proposal indexing, relationships and context for phrase grounding. In: Asian Conference on Computer Vision, Springer, pp 451–467 (2018)

  19. Liao, Y., Liu, S., Li, G., et al.: A real-time cross-modality correlation filtering method for referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10,880–10,889 (2020)

  20. Lin, T.Y., Maire, M., Belongie, S., et al.: Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755 (2014)

  21. Liu, D., Zhang, H., Wu, F., et al.: Learning to assemble neural module tree networks for visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4673–4682 (2019a)

  22. Liu, X., Wang, Z., Shao, J., et al.: Improving referring expression grounding with cross-modal attention-guided erasing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1950–1959 (2019b)

  23. Liu, Y., Li, S., Wu, Y., et al.: Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3042–3051 (2022)

  24. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  25. Mao, J., Huang, J., Toshev, A., et al.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11–20 (2016)

  26. Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer, pp 792–807 (2016)

  27. Parmar, N., Vaswani, A., Uszkoreit, J., et al.: Image transformer. In: International conference on machine learning, PMLR, pp 4055–4064 (2018)

  28. Plummer, B.A., Wang, L., Cervantes, C.M., et al.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp 2641–2649 (2015)

  29. Plummer, B.A., Kordas, P., Kiapour, M.H., et al.: Conditional image-text embedding networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 249–264 (2018)

  30. Qian, S., Wang, J., Hu, J., et al.: Hierarchical multi-modal contextual attention network for fake news detection. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 153–162 (2021)

  31. Qiao, Y., Deng, C., Wu, Q.: Referring expression comprehension: a survey of methods and datasets. IEEE Transact. Multimedia 23, 4426–4440 (2020)

    Article  Google Scholar 

  32. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  33. Rezatofighi, H., Tsoi, N., Gwak, J., et al.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666 (2019)

  34. Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4694–4703 (2019)

  35. Song, Y., Wang, J., Liang, Z., et al.: Utilizing bert intermediate layers for aspect based sentiment analysis and natural language inference. arXiv preprint arXiv:2002.04815 (2020)

  36. Wang, L., Li, Y., Huang, J., et al.: Learning two-branch neural networks for image-text matching tasks. IEEE Transact. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)

    Article  Google Scholar 

  37. Wang, P., Wu, Q., Cao, J., et al.: Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1960–1968 (2019)

  38. Wu, P., He, X., Tang, M., et al.: Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on Multimedia, pp 3518–3527 (2021)

  39. Yang, C., Wang, G., Li, D., et al.: Ppgn: Phrase-guided proposal generation network for referring expression comprehension. arXiv preprint arXiv:2012.10890 (2020a)

  40. Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4644–4653 (2019a)

  41. Yang, S., Li, G., Yu, Y.: Propagating over phrase relations for one-stage visual grounding. In: European Conference on Computer Vision, Springer, pp 589–605 (2020b)

  42. Yang, Z., Gong, B., Wang, L., et al.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4683–4693 (2019b)

  43. Yang, Z., Chen, T., Wang, L., et al.: Improving one-stage visual grounding by recursive sub-query construction. In: European Conference on Computer Vision, Springer, pp 387–404 (2020c)

  44. Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transact. Assoc. Computat. Linguistics 2, 67–78 (2014)

    Article  Google Scholar 

  45. Yu, L., Poirson, P., Yang, S., et al.: Modeling context in referring expressions. In: European Conference on Computer Vision, Springer, pp 69–85 (2016)

  46. Yu, L., Lin, Z., Shen, X., et al.: Mattnet: Modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1307–1315 (2018a)

  47. Yu, Z., Yu, J., Xiang, C., et al.: Rethinking diversified and discriminative proposal generation for visual grounding. arXiv preprint arXiv:1805.03508 (2018b)

  48. Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4158–4166 (2018)

  49. Zhuang, B., Wu, Q., Shen, C., et al.: Parallel attention: A unified framework for visual object discovery through dialogs and queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4252–4261 (2018)

Download references

Acknowledgements

This work was supported in part by the University Synergy Innovation Program of Anhui Province (No. GXXT-2022-043, No. GXXT-2022-037), Anhui Provincial Key Research and Development Program (No. 2022a05020042), Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi-Dimensional Modeling (No. GJZZX2021KF01) and National Natural Science Foundation of China (No. 61902104, 62105002).

Author information

Authors and Affiliations

Authors

Contributions

XX and GL: wrote the main manuscript text. YS: resources, supervision. WN: software, data curation. YH: visualization. FN: conceptualization, methodology, funding acquisition, writing - review and editing.

Corresponding author

Correspondence to Fudong Nian.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, X., Lv, G., Sun, Y. et al. Hierarchical cross-modal contextual attention network for visual grounding. Multimedia Systems 29, 2073–2083 (2023). https://doi.org/10.1007/s00530-023-01097-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01097-8

Keywords

Navigation