Abstract
The task of Referring Expression Comprehension is a multimodal task, which involves two different fields: Computer Vision and Natural Language Processing. Specifically, the task is to locate image region that correspond to the description provided in the given a image and a natural language expression. This paper aims to address the problem that the current task can not effectively fuse visual and textual features in the multimodal alignment stage and can not effectively utilize visual and textual formation in the prediction stage. Two improvement measures are proposed: multimodal feature fusion and iterative reasoning based on multimodal attention mechanism. In the multimodal feature fusion stage, three feature fusion modules are used to fuse visual and textual features from different perspectives to obtain rich visual and textual information; in the iterative reasoning stage, visual and textual features are accessed several times to gradually optimize the target prediction region. In order to verify the performance of the proposed method in this paper, a large number of experiments were conducted on three public datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Yu, L., Lin, Z., et al.: Mattnet: modular attention network for referring expression comprehension. In: CVPR, pp. 1307–1315 (2018)
Wang, P., Wu, Q., et al.: Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In: CVPR, pp. 1960–1968 (2019)
Liu, D., Zhang, H., Wu, F., Zha, Z.-J.: Learning to assemble neural module tree networks for visual grounding. In: ICCV, pp. 4673–4682 (2019)
Yang, S., et al.: Dynamic graph attention for referring expression comprehension. In: ICCV, pp. 4644–4653 (2019)
Yang, Z., et al.: A fast and accurate one-stage approach to visual grounding. In: ICCV, pp. 4683–4693 (2019)
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp.10034–10043 (2020)
Yang, Z., Chen, T., Wang, L., Luo, J.: Improving one-stage visual grounding by recursive sub-query construction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_23
Huang, B., Lian, D., et al.: Look before you leap: Learning landmark features for one-stage visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16888–16897 (2021)
Deng, J., et al.: Transvg: End-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1769–1779 (2021)
Deng, J., Yang, Z., Liu, D., et al.:Transvg++: End-to-end visual grounding with language conditioned vision transformer. arXiv preprint arXiv:2206.06619 (2022)
Devlin, J., Chang, M.-W. Lee, K. Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692 (2019)
Lan, Z, et al.: Albert: A lite bert for self-supervised learning of language representations. In: ICLR (2020)
Cheng, B., et al.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021)
Dosovitskiy, A., et al.: An image is worth 16 × 16 words: Transformers for image recognition at scale. In: ICLR (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Wang, H., et al.: Max-deeplab: End-to-end panoptic segmentation with mask transformers. In: CVPR, pp. 5463–5474 (2021)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Kazemzadeh, S., Ordonez, V., Matten, M., et al.: Referitgame: Referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
Mao, J., et al.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20 (2016)
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)
Chen, L., et al.: Ref-nms: breaking proposal bottlenecks in two-stage referring expression grounding. Proc. AAAI Conf. Artif. Intell. 35, 1036–1044 (2021)
Acknowledgment
This work is supported by the Inner Mongolia Science and Technology Project No. 2021GG0166.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, C., Wu, W., Zhao, Y. (2023). Referring Expression Comprehension Based on Cross Modal Feature Fusion and Iterative Reasoning. In: Lu, H., et al. Image and Graphics . ICIG 2023. Lecture Notes in Computer Science, vol 14358. Springer, Cham. https://doi.org/10.1007/978-3-031-46314-3_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-46314-3_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46313-6
Online ISBN: 978-3-031-46314-3
eBook Packages: Computer ScienceComputer Science (R0)