Abstract
Image-text matching is a rapidly evolving field in multimodal learning, aiming to measure the similarity between image and text. Despite significant progress in image-text matching in recent years, most existing methods for image-text interaction rely on static ways, often overlooking the substantial variations in scene complexity among different samples. Actually, multimodal interaction strategies should be flexibly adjusted according to the scene complexity of different inputs. For instance, excessive multimodal interactions may introduce noise when dealing with simple samples. In this paper, we propose a novel Structure-aware Adaptive Hybrid Interaction Modeling (SAHIM) network, which can adaptively adjust the image-text interaction strategies based on varying inputs. Moreover, we design the Multimodal Graph Inference (MGI) module to explore potential structural connections between global and local features, as well as the Entity Attention Enhancement (EAE) module to filter out irrelevant local segments. Finally, we align the image and text features with the bidirectional triplet loss function. To validate the proposed SAHIM model, we design and conduct comprehensive experiments on Flickr30K and MSCOCO. Experimental results show that SAHIM outperforms state-of-the-art methods on both datasets, demonstrating the superiority of our model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cao, M., Li, S., Li, J., Nie, L., Zhang, M.: Image-text retrieval: a survey on recent research and development. arXiv preprint arXiv:2203.14713 (2022)
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655ā12663 (2020)
Cheng, Y., Zhu, X., Qian, J., Wen, F., Liu, P.: Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 18(4), 1ā23 (2022)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., Fu, J.: Seeing out of the box: end-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12976ā12985 (2021)
Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. In: IJCAI International Joint Conference on Artificial Intelligence, pp. 765ā771 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krishna, R.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32ā73 (2017)
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201ā216 (2018)
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4654ā4662 (2019)
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 641ā656 (2022)
Li, W.H., Yang, S., Wang, Y., Song, D., Li, X.Y.: Multi-level similarity learning for image-text retrieval. Inf. Process. Manage. 58(1), 102432 (2021)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740ā755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Miyawaki, S., Hasegawa, T., Nishida, K., Kato, T., Suzuki, J.: Scene-text aware image and text retrieval with dual-encoder. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 422ā433 (2022)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641ā2649 (2015)
Qu, L., Liu, M., Wu, J., Gao, Z., Nie, L.: Dynamic modality interaction modeling for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1104ā1113 (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, G., Xu, X., Shen, F., Lu, H., Ji, Y., Shen, H.T.: Cross-modal dynamic networks for video moment retrieval with text query. IEEE Trans. Multimedia 24, 1221ā1232 (2022)
Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 18ā34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_2
Wang, Y., et al.: Rare-aware attention network for image-text matching. Inf. Process. Manage. 60(3), 103280 (2023)
Wang, Y., et al.: Wasserstein coupled graph learning for cross-modal retrieval. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1793ā1802. IEEE (2021)
Wu, J., Wu, C., Lu, J., Wang, L., Cui, X.: Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circuits Syst. Video Technol. 32(1), 388ā397 (2021)
You, S., et al.: What image do you need? A two-stage framework for image selection in e-commerce. In: Companion Proceedings of the ACM Web Conference 2023, pp. 452ā456 (2023)
Yu, R., Jin, F., Qiao, Z., Yuan, Y., Wang, G.: Multi-scale image-text matching network for scene and spatio-temporal images. Future Gener. Comput. Syst. 142, 292ā300 (2023)
Zhang, J., He, X., Qing, L., Liu, L., Luo, X.: Cross-modal multi-relationship aware reasoning for image-text matching. Multimedia Tools Appl. 81, 12005ā12027 (2022)
Zhao, G., Zhang, C., Shang, H., Wang, Y., Zhu, L., Qian, X.: Generative label fused network for image-text matching. Knowl.-Based Syst. 263, 110280 (2023)
Zhu, J., Li, Z., Zeng, Y., Wei, J., Ma, H.: Image-text matching with fine-grained relational dependency and bidirectional attention-based generative networks. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 395ā403 (2022)
Acknowledgements
This work was supported by the Major Program of the National Natural Science Foundation of China (No. 61991410), the Natural Science Foundation of Shanghai (No. 23ZR1422800), and the Program of the Pujiang National Laboratory (No. P22KN00391).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, W., Wang, J., Wang, C., Peng, Y., Xie, S. (2024). Structure-Aware Adaptive Hybrid Interaction Modeling forĀ Image-Text Matching. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-53305-1_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53304-4
Online ISBN: 978-3-031-53305-1
eBook Packages: Computer ScienceComputer Science (R0)