Structure-Aware Adaptive Hybrid Interaction Modeling for Image-Text Matching

Liu, Wei; Wang, Jiahuan; Wang, Chao; Peng, Yan; Xie, Shaorong

doi:10.1007/978-3-031-53305-1_25

Wei Liu^14,15,
Jiahuan Wang¹⁴,
Chao Wang^16,17,
Yan Peng^15,16,17 &
…
Shaorong Xie¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14554))

Included in the following conference series:

International Conference on Multimedia Modeling

391 Accesses

Abstract

Image-text matching is a rapidly evolving field in multimodal learning, aiming to measure the similarity between image and text. Despite significant progress in image-text matching in recent years, most existing methods for image-text interaction rely on static ways, often overlooking the substantial variations in scene complexity among different samples. Actually, multimodal interaction strategies should be flexibly adjusted according to the scene complexity of different inputs. For instance, excessive multimodal interactions may introduce noise when dealing with simple samples. In this paper, we propose a novel Structure-aware Adaptive Hybrid Interaction Modeling (SAHIM) network, which can adaptively adjust the image-text interaction strategies based on varying inputs. Moreover, we design the Multimodal Graph Inference (MGI) module to explore potential structural connections between global and local features, as well as the Entity Attention Enhancement (EAE) module to filter out irrelevant local segments. Finally, we align the image and text features with the bidirectional triplet loss function. To validate the proposed SAHIM model, we design and conduct comprehensive experiments on Flickr30K and MSCOCO. Experimental results show that SAHIM outperforms state-of-the-art methods on both datasets, demonstrating the superiority of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cao, M., Li, S., Li, J., Nie, L., Zhang, M.: Image-text retrieval: a survey on recent research and development. arXiv preprint arXiv:2203.14713 (2022)
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)
Google Scholar
Cheng, Y., Zhu, X., Qian, J., Wen, F., Liu, P.: Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 18(4), 1–23 (2022)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., Fu, J.: Seeing out of the box: end-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12976–12985 (2021)
Google Scholar
Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. In: IJCAI International Joint Conference on Artificial Intelligence, pp. 765–771 (2021)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krishna, R.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32–73 (2017)
Article MathSciNet Google Scholar
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)
Google Scholar
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4654–4662 (2019)
Google Scholar
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 641–656 (2022)
Article Google Scholar
Li, W.H., Yang, S., Wang, Y., Song, D., Li, X.Y.: Multi-level similarity learning for image-text retrieval. Inf. Process. Manage. 58(1), 102432 (2021)
Article Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Miyawaki, S., Hasegawa, T., Nishida, K., Kato, T., Suzuki, J.: Scene-text aware image and text retrieval with dual-encoder. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 422–433 (2022)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Google Scholar
Qu, L., Liu, M., Wu, J., Gao, Z., Nie, L.: Dynamic modality interaction modeling for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1104–1113 (2021)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, G., Xu, X., Shen, F., Lu, H., Ji, Y., Shen, H.T.: Cross-modal dynamic networks for video moment retrieval with text query. IEEE Trans. Multimedia 24, 1221–1232 (2022)
Article Google Scholar
Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 18–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_2
Chapter Google Scholar
Wang, Y., et al.: Rare-aware attention network for image-text matching. Inf. Process. Manage. 60(3), 103280 (2023)
Article Google Scholar
Wang, Y., et al.: Wasserstein coupled graph learning for cross-modal retrieval. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1793–1802. IEEE (2021)
Google Scholar
Wu, J., Wu, C., Lu, J., Wang, L., Cui, X.: Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circuits Syst. Video Technol. 32(1), 388–397 (2021)
Article Google Scholar
You, S., et al.: What image do you need? A two-stage framework for image selection in e-commerce. In: Companion Proceedings of the ACM Web Conference 2023, pp. 452–456 (2023)
Google Scholar
Yu, R., Jin, F., Qiao, Z., Yuan, Y., Wang, G.: Multi-scale image-text matching network for scene and spatio-temporal images. Future Gener. Comput. Syst. 142, 292–300 (2023)
Article Google Scholar
Zhang, J., He, X., Qing, L., Liu, L., Luo, X.: Cross-modal multi-relationship aware reasoning for image-text matching. Multimedia Tools Appl. 81, 12005–12027 (2022)
Google Scholar
Zhao, G., Zhang, C., Shang, H., Wang, Y., Zhu, L., Qian, X.: Generative label fused network for image-text matching. Knowl.-Based Syst. 263, 110280 (2023)
Article Google Scholar
Zhu, J., Li, Z., Zeng, Y., Wei, J., Ma, H.: Image-text matching with fine-grained relational dependency and bidirectional attention-based generative networks. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 395–403 (2022)
Google Scholar

Download references

Acknowledgements

This work was supported by the Major Program of the National Natural Science Foundation of China (No. 61991410), the Natural Science Foundation of Shanghai (No. 23ZR1422800), and the Program of the Pujiang National Laboratory (No. P22KN00391).

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China
Wei Liu, Jiahuan Wang & Shaorong Xie
Shanghai Artificial Intelligence Laboratory, Shanghai, 201114, China
Wei Liu & Yan Peng
School of Future Technology, Shanghai University, Shanghai, 200444, China
Chao Wang & Yan Peng
Institute of Artificial Intelligence, Shanghai University, Shanghai, 200444, China
Chao Wang & Yan Peng

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jiahuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Peng
View author publications
You can also search for this author in PubMed Google Scholar
Shaorong Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao Wang .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Wang, J., Wang, C., Peng, Y., Xie, S. (2024). Structure-Aware Adaptive Hybrid Interaction Modeling for Image-Text Matching. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-53305-1_25
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53304-4
Online ISBN: 978-3-031-53305-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Structure-Aware Adaptive Hybrid Interaction Modeling for Image-Text Matching