Skip to main content

Novel Clustering Aggregation and Multi-grained Alignment for Image-Text Matching

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15321))

Included in the following conference series:

  • 302 Accesses

Abstract

As a challenging multi-modal task, image-text matching continues to be an attractive topic of research. The essence of this task lies in narrowing down the semantic disparity between vision and language to align them better. Existing works have either focused on coarse-grained alignment between global images and texts or fine-grained alignment between salient regions and words. However, they do not distinguish between considering related and redundant pairs (i.e., regions with no matching words or pairs with low relevance). We thereby propose a Novel Clustering Aggregation and multi-grained Alignment network (NCAA), which utilizes cross-modal contextual clustering to group regions based on semantic information consistent with the text content. Specifically, we leverage textual fragments as clustering centers, similarity between regions and fragments as propagation medium and delicately devise two mask mechanisms for simultaneous and distinguishable consideration of both related and redundant pairs. Two alignment modules of different granularities are also introduced to achieve the multi-grained alignment. By incorporating both global and local similarity into the training and inference phases, our model attains further enhancements. Finally, we conduct extensive experiments on two benchmark datasets, Flickr30K and MSCOCO, demonstrating the efficacy of our framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012)

    Article  Google Scholar 

  2. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6077–6086 (2018)

    Google Scholar 

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  4. Diao, H., Zhang, Y., Liu, W., Ruan, X., Lu, H.: Plug-and-play regulators for image-text matching. IEEE Transactions on Image Processing (2023)

    Google Scholar 

  5. Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 1218–1226 (2021)

    Google Scholar 

  6. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)

  7. Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1931–1941 (2023)

    Google Scholar 

  8. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp. 201–216 (2018)

    Google Scholar 

  9. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)

    Google Scholar 

  10. Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International conference on computer vision. pp. 4654–4662 (2019)

    Google Scholar 

  11. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)

    Google Scholar 

  12. Ma, X., Zhou, Y., Wang, H., Qin, C., Sun, B., Liu, C., Fu, Y.: Image as set of points. arXiv preprint arXiv:2303.01494 (2023)

  13. Messina, N., Amato, G., Esuli, A., Falchi, F., Gennaro, C., Marchand-Maillet, S.: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17(4), 1–23 (2021)

    Google Scholar 

  14. Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 5222–5229. IEEE (2021)

    Google Scholar 

  15. Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting vision and language with localized narratives (2019)

    Google Scholar 

  16. Qu, L., Liu, M., Wu, J., Gao, Z., Nie, L.: Dynamic modality interaction modeling for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1104–1113 (2021)

    Google Scholar 

  17. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)

    Google Scholar 

  18. Shi, B., Ji, L., Lu, P., Niu, Z., Duan, N.: Knowledge aware semantic concept expansion for image-text matching. In: IJCAI. vol. 1, p. 2 (2019)

    Google Scholar 

  19. Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5005–5013 (2016)

    Google Scholar 

  20. Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., Fan, X.: Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748 (2019)

  21. Yan, S., Yu, L., Xie, Y.: Discrete-continuous action space policy gradient-based attention for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8096–8105 (2021)

    Google Scholar 

  22. Yang, F., Sun, Q., Jin, H., Zhou, Z.: Superpixel segmentation with fully convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13964–13973 (2020)

    Google Scholar 

  23. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 67–78 (2014)

    Article  Google Scholar 

  24. Zhang, D., Wu, X.J., Liu, Z., Yu, J., Kitter, J.: Fast discrete cross-modal hashing based on label relaxation and matrix factorization. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 4845–4850. IEEE (2021)

    Google Scholar 

  25. Zhang, D., Wu, X.J., Yu, J.: Discrete bidirectional matrix factorization hashing for zero-shot cross-media retrieval. In: Chinese conference on pattern recognition and computer vision (PRCV). pp. 524–536. Springer (2021)

    Google Scholar 

  26. Zhang, H., Mao, Z., Zhang, K., Zhang, Y.: Show your faith: Cross-modal confidence-aware network for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 3262–3270 (2022)

    Google Scholar 

  27. Zhang, K., Mao, Z., Wang, Q., Zhang, Y.: Negative-aware attention framework for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15661–15670 (2022)

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (62020106012, 62202204), the National Key Research and Development Program of China (Grant No. 2023YFF1105102), the National Natural Science and the 111 Project of Ministry of Education of China (Grant No. B12018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao-jun Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, S., Wu, Xj., Xu, T., Zhang, D. (2025). Novel Clustering Aggregation and Multi-grained Alignment for Image-Text Matching. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15321. Springer, Cham. https://doi.org/10.1007/978-3-031-78305-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78305-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78304-3

  • Online ISBN: 978-3-031-78305-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics