Location Attention Knowledge Embedding Model for Image-Text Matching

Xu, Guoqing; Hu, Min; Wang, Xiaohua; Yang, Jiaoyun; Li, Nan; Zhang, Qingyu

doi:10.1007/978-981-99-8429-9_33

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14425))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1705 Accesses

Abstract

Image-text matching is the core algorithm of cross-modal retrieval, which plays a central role in connecting vision and text. Due to the well-known semantic gap between visual and textual modalities, yet image-text matching is a vital challenging task. In order to reduce the huge semantic difference between images and texts, existing methods use the consensus knowledge for image-text matching tasks. However, the consensus knowledge is only extracted based on the co-occurrence frequency of words in sentences in the corpus, and does not consider the semantic information contained in the image, resulting in a decline in semantic matching performance. To solve this issue, we propose a Location Attention Knowledge Embedding (LAKE) model to improve the consensus knowledge utilization by inferring the location of objects in an image. Specifically, our model consists of three parts: Firstly, we design a location feature extraction (LFE) module, which divides the image into blocks, uses the location attention to generate valuable location features, and then splices the location features with the extracted regional image features to obtain the image features containing location information. At the same time, text features are extracted using the BERT model. Secondly, we use a knowledge representation module to extract the consensus knowledge features. Finally, the similarity between the image and the text is calculated based on the knowledge fusion feature to complete the matching process. Quantitative and qualitative results on public datasets Flickr30k and MSCOCO demonstrate the effectiveness of the method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SAM: cross-modal semantic alignments module for image-text retrieval

Article 26 June 2023

Multi-view and region reasoning semantic enhancement for image-text retrieval

Article 15 June 2024

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

Article 05 February 2024

References

Liu, C., Mao, Z., Zhang, T., et al.: Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10921–10930 (2020)
Google Scholar
Gu, J., Zhao, H., Lin, Z., et al.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969–1978 (2019)
Google Scholar
Shi, B., Ji, L., Lu, P., et al.: Knowledge aware semantic concept expansion for image-text matching. In: Proceedings of the International Joint Conference on Artificial Intelligence (2019)
Google Scholar
Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 18–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_2
Chapter Google Scholar
Zhang, L., Li, M., Yan, K., et al.: Hierarchical knowledge-based graph embedding model for image-text matching in IoTs. IEEE Internet of Things J. 9(12), 9399–9409 (2021)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Chapter Google Scholar
Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Anderson, P., He, X., Buehler, C., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, p. 30 (2017)
Google Scholar
Devlin, J., Chang, K., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Huang, Y., Wu, Q., Song, C., et al.: Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6163–6171 (2018)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Plummer, B., Wang, L., Cervantes, C., et al.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Li, K., Zhang, Y., Li, K., et al.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International conference on computer vision, pp. 4654–4662 (2019)
Google Scholar
Wang, Y., Yang, H., Bai, X., et al.: PFAN++: bi-directional image-text retrieval with position focused attention network. IEEE Trans. Multimedia 23, 3362–3376 (2020)
Google Scholar
Chen, H., Ding, G., Liu, X., et al.: IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)
Google Scholar
Wei, X., Zhang, T., Li, Y., et al.: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10941–10950 (2020)
Google Scholar
Ge, X., Chen, F., Jose, J.M., et al.: Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5185–5193 (2021)
Google Scholar
Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. arXiv preprint arXiv:2106.06509 (2021)
Qi, S., Yang, L., Li, C., et al.: Dual relation-aware synergistic attention network for image-text matching. In: 2022 11th International Conference on Communications, Circuits and Systems (ICCCAS), pp. 251–256 (2022)
Google Scholar
Zhao, G., Zhang, C., Shang, H., et al.: Generative label fused network for image-text matching. Knowl.-Based Syst. 263, 110280 (2023)
Google Scholar

Download references

Ackonwlegement

This work was supported in part by the National Natural Science Foundation of China under Grant62176084, and Grant62176083, and in part by the Fundamental Research Funds for the Central Universities of China under Grant PA2022GDSK0068 and PA2022GDSK0066.

Author information

Authors and Affiliations

Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, National Smart Eldercare International Science and Technology Cooperation Base, Hefei University of Technology, Hefei, 230602, China
Guoqing Xu, Min Hu, Xiaohua Wang, Jiaoyun Yang & Qingyu Zhang
School of Mental Health and Psychological Sciences, Anhui Medical University, 81 Meishan Road, Shushan District, Hefei, 230032, Anhui, China
Nan Li

Authors

Guoqing Xu
View author publications
You can also search for this author in PubMed Google Scholar
Min Hu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaoyun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Nan Li
View author publications
You can also search for this author in PubMed Google Scholar
Qingyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoqing Xu .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, G., Hu, M., Wang, X., Yang, J., Li, N., Zhang, Q. (2024). Location Attention Knowledge Embedding Model for Image-Text Matching. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_33

Download citation

DOI: https://doi.org/10.1007/978-981-99-8429-9_33
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8428-2
Online ISBN: 978-981-99-8429-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Location Attention Knowledge Embedding Model for Image-Text Matching