Novel Clustering Aggregation and Multi-grained Alignment for Image-Text Matching

Zhang, Shuming; Wu, Xiao-jun; Xu, Tianyang; Zhang, Donglin

doi:10.1007/978-3-031-78305-0_5

Shuming Zhang¹³,
Xiao-jun Wu¹³,
Tianyang Xu¹³ &
…
Donglin Zhang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15321))

Included in the following conference series:

International Conference on Pattern Recognition

302 Accesses

Abstract

As a challenging multi-modal task, image-text matching continues to be an attractive topic of research. The essence of this task lies in narrowing down the semantic disparity between vision and language to align them better. Existing works have either focused on coarse-grained alignment between global images and texts or fine-grained alignment between salient regions and words. However, they do not distinguish between considering related and redundant pairs (i.e., regions with no matching words or pairs with low relevance). We thereby propose a Novel Clustering Aggregation and multi-grained Alignment network (NCAA), which utilizes cross-modal contextual clustering to group regions based on semantic information consistent with the text content. Specifically, we leverage textual fragments as clustering centers, similarity between regions and fragments as propagation medium and delicately devise two mask mechanisms for simultaneous and distinguishable consideration of both related and redundant pairs. Two alignment modules of different granularities are also introduced to achieve the multi-grained alignment. By incorporating both global and local similarity into the training and inference phases, our model attains further enhancements. Finally, we conduct extensive experiments on two benchmark datasets, Flickr30K and MSCOCO, demonstrating the efficacy of our framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

Article 05 February 2024

Prototype local–global alignment network for image–text retrieval

Article 06 October 2022

A method for image–text matching based on semantic filtering and adaptive adjustment

Article Open access 29 August 2024

References

Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012)
Article Google Scholar
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6077–6086 (2018)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Diao, H., Zhang, Y., Liu, W., Ruan, X., Lu, H.: Plug-and-play regulators for image-text matching. IEEE Transactions on Image Processing (2023)
Google Scholar
Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 1218–1226 (2021)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1931–1941 (2023)
Google Scholar
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp. 201–216 (2018)
Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)
Google Scholar
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International conference on computer vision. pp. 4654–4662 (2019)
Google Scholar
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)
Google Scholar
Ma, X., Zhou, Y., Wang, H., Qin, C., Sun, B., Liu, C., Fu, Y.: Image as set of points. arXiv preprint arXiv:2303.01494 (2023)
Messina, N., Amato, G., Esuli, A., Falchi, F., Gennaro, C., Marchand-Maillet, S.: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17(4), 1–23 (2021)
Google Scholar
Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 5222–5229. IEEE (2021)
Google Scholar
Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting vision and language with localized narratives (2019)
Google Scholar
Qu, L., Liu, M., Wu, J., Gao, Z., Nie, L.: Dynamic modality interaction modeling for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1104–1113 (2021)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
Google Scholar
Shi, B., Ji, L., Lu, P., Niu, Z., Duan, N.: Knowledge aware semantic concept expansion for image-text matching. In: IJCAI. vol. 1, p. 2 (2019)
Google Scholar
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5005–5013 (2016)
Google Scholar
Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., Fan, X.: Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748 (2019)
Yan, S., Yu, L., Xie, Y.: Discrete-continuous action space policy gradient-based attention for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8096–8105 (2021)
Google Scholar
Yang, F., Sun, Q., Jin, H., Zhou, Z.: Superpixel segmentation with fully convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13964–13973 (2020)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 67–78 (2014)
Article Google Scholar
Zhang, D., Wu, X.J., Liu, Z., Yu, J., Kitter, J.: Fast discrete cross-modal hashing based on label relaxation and matrix factorization. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 4845–4850. IEEE (2021)
Google Scholar
Zhang, D., Wu, X.J., Yu, J.: Discrete bidirectional matrix factorization hashing for zero-shot cross-media retrieval. In: Chinese conference on pattern recognition and computer vision (PRCV). pp. 524–536. Springer (2021)
Google Scholar
Zhang, H., Mao, Z., Zhang, K., Zhang, Y.: Show your faith: Cross-modal confidence-aware network for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 3262–3270 (2022)
Google Scholar
Zhang, K., Mao, Z., Wang, Q., Zhang, Y.: Negative-aware attention framework for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15661–15670 (2022)
Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (62020106012, 62202204), the National Key Research and Development Program of China (Grant No. 2023YFF1105102), the National Natural Science and the 111 Project of Ministry of Education of China (Grant No. B12018).

Author information

Authors and Affiliations

School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
Shuming Zhang, Xiao-jun Wu, Tianyang Xu & Donglin Zhang

Authors

Shuming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-jun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tianyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Donglin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-jun Wu .

Editor information

Editors and Affiliations

University of Salford, Salford, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, S., Wu, Xj., Xu, T., Zhang, D. (2025). Novel Clustering Aggregation and Multi-grained Alignment for Image-Text Matching. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15321. Springer, Cham. https://doi.org/10.1007/978-3-031-78305-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-78305-0_5
Published: 04 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78304-3
Online ISBN: 978-3-031-78305-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)