Skip to main content

Structure-Aware Adaptive Hybrid Interaction Modeling forĀ Image-Text Matching

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14554))

Included in the following conference series:

  • 391 Accesses

Abstract

Image-text matching is a rapidly evolving field in multimodal learning, aiming to measure the similarity between image and text. Despite significant progress in image-text matching in recent years, most existing methods for image-text interaction rely on static ways, often overlooking the substantial variations in scene complexity among different samples. Actually, multimodal interaction strategies should be flexibly adjusted according to the scene complexity of different inputs. For instance, excessive multimodal interactions may introduce noise when dealing with simple samples. In this paper, we propose a novel Structure-aware Adaptive Hybrid Interaction Modeling (SAHIM) network, which can adaptively adjust the image-text interaction strategies based on varying inputs. Moreover, we design the Multimodal Graph Inference (MGI) module to explore potential structural connections between global and local features, as well as the Entity Attention Enhancement (EAE) module to filter out irrelevant local segments. Finally, we align the image and text features with the bidirectional triplet loss function. To validate the proposed SAHIM model, we design and conduct comprehensive experiments on Flickr30K and MSCOCO. Experimental results show that SAHIM outperforms state-of-the-art methods on both datasets, demonstrating the superiority of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cao, M., Li, S., Li, J., Nie, L., Zhang, M.: Image-text retrieval: a survey on recent research and development. arXiv preprint arXiv:2203.14713 (2022)

  2. Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655ā€“12663 (2020)

    Google ScholarĀ 

  3. Cheng, Y., Zhu, X., Qian, J., Wen, F., Liu, P.: Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 18(4), 1ā€“23 (2022)

    ArticleĀ  Google ScholarĀ 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  5. Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., Fu, J.: Seeing out of the box: end-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12976ā€“12985 (2021)

    Google ScholarĀ 

  6. Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. In: IJCAI International Joint Conference on Artificial Intelligence, pp. 765ā€“771 (2021)

    Google ScholarĀ 

  7. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  8. Krishna, R.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32ā€“73 (2017)

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  9. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201ā€“216 (2018)

    Google ScholarĀ 

  10. Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4654ā€“4662 (2019)

    Google ScholarĀ 

  11. Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 641ā€“656 (2022)

    ArticleĀ  Google ScholarĀ 

  12. Li, W.H., Yang, S., Wang, Y., Song, D., Li, X.Y.: Multi-level similarity learning for image-text retrieval. Inf. Process. Manage. 58(1), 102432 (2021)

    ArticleĀ  Google ScholarĀ 

  13. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740ā€“755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    ChapterĀ  Google ScholarĀ 

  14. Miyawaki, S., Hasegawa, T., Nishida, K., Kato, T., Suzuki, J.: Scene-text aware image and text retrieval with dual-encoder. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 422ā€“433 (2022)

    Google ScholarĀ 

  15. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641ā€“2649 (2015)

    Google ScholarĀ 

  16. Qu, L., Liu, M., Wu, J., Gao, Z., Nie, L.: Dynamic modality interaction modeling for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1104ā€“1113 (2021)

    Google ScholarĀ 

  17. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google ScholarĀ 

  18. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google ScholarĀ 

  19. Wang, G., Xu, X., Shen, F., Lu, H., Ji, Y., Shen, H.T.: Cross-modal dynamic networks for video moment retrieval with text query. IEEE Trans. Multimedia 24, 1221ā€“1232 (2022)

    ArticleĀ  Google ScholarĀ 

  20. Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 18ā€“34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_2

    ChapterĀ  Google ScholarĀ 

  21. Wang, Y., et al.: Rare-aware attention network for image-text matching. Inf. Process. Manage. 60(3), 103280 (2023)

    ArticleĀ  Google ScholarĀ 

  22. Wang, Y., et al.: Wasserstein coupled graph learning for cross-modal retrieval. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1793ā€“1802. IEEE (2021)

    Google ScholarĀ 

  23. Wu, J., Wu, C., Lu, J., Wang, L., Cui, X.: Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circuits Syst. Video Technol. 32(1), 388ā€“397 (2021)

    ArticleĀ  Google ScholarĀ 

  24. You, S., et al.: What image do you need? A two-stage framework for image selection in e-commerce. In: Companion Proceedings of the ACM Web Conference 2023, pp. 452ā€“456 (2023)

    Google ScholarĀ 

  25. Yu, R., Jin, F., Qiao, Z., Yuan, Y., Wang, G.: Multi-scale image-text matching network for scene and spatio-temporal images. Future Gener. Comput. Syst. 142, 292ā€“300 (2023)

    ArticleĀ  Google ScholarĀ 

  26. Zhang, J., He, X., Qing, L., Liu, L., Luo, X.: Cross-modal multi-relationship aware reasoning for image-text matching. Multimedia Tools Appl. 81, 12005ā€“12027 (2022)

    Google ScholarĀ 

  27. Zhao, G., Zhang, C., Shang, H., Wang, Y., Zhu, L., Qian, X.: Generative label fused network for image-text matching. Knowl.-Based Syst. 263, 110280 (2023)

    ArticleĀ  Google ScholarĀ 

  28. Zhu, J., Li, Z., Zeng, Y., Wei, J., Ma, H.: Image-text matching with fine-grained relational dependency and bidirectional attention-based generative networks. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 395ā€“403 (2022)

    Google ScholarĀ 

Download references

Acknowledgements

This work was supported by the Major Program of the National Natural Science Foundation of China (No. 61991410), the Natural Science Foundation of Shanghai (No. 23ZR1422800), and the Program of the Pujiang National Laboratory (No. P22KN00391).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, W., Wang, J., Wang, C., Peng, Y., Xie, S. (2024). Structure-Aware Adaptive Hybrid Interaction Modeling forĀ Image-Text Matching. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53305-1_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53304-4

  • Online ISBN: 978-3-031-53305-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics