ItrievalKD: An Iterative Retrieval Framework Assisted with Knowledge Distillation for Noisy Text-to-Image Retrieval

Liu, Zhen; Zhu, Yongxin; Gao, Zhujin; Sheng, Xin; Xu, Linli

doi:10.1007/978-3-031-33380-4_20

Zhen Liu¹⁰,
Yongxin Zhu¹⁰,
Zhujin Gao¹⁰,
Xin Sheng¹⁰ &
…
Linli Xu¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13937))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

625 Accesses
1 Citations

Abstract

Benefiting from the superiority of the pretraining paradigm on large-scale multi-modal data, current cross-modal pretrained models (such as CLIP) have shown excellent performance on text-to-image retrieval. However, the current research mainly focuses on the scenarios with strong matching of images and texts, which is not always available in practice. For example, in social media content or daily communication, the text is not always completely related to the image and may also contain some irrelevant content, which introduces non-negligible noise to text-to-image retrieval. The noisy multi-modal setting is significantly different from the current cross-modal pretraining corpus, which may lead to significant degradation of the retrieval performance of the general image-text retrieval models. In this paper, we focus on the task of noisy text-to-image retrieval and propose an iterative retrieval framework which firstly retrieves the key-semantic information from the noisy text with knowledge distillation, followed by retrieving the relevant image from the image pool with the key-semantic clue. Experiments on Noisy-MSCOCO and PhotoChat datasets confirm the superiority of the proposed iterative retrieval framework in the task of noisy text-to-image retrieval compared with the general retrieval models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Cui, Y., et al.: Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 797–806 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. Comput. Sci. 14(7), 38–39 (2015)
Google Scholar
Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. arXiv preprint arXiv:2106.06509 (2021)
Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763 (2019)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference computer vision (ECCV), pp. 201–216 (2018)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, Y.: Fine-tune bert for extractive summarization (2019)
Google Scholar
Mihalcea, R., Tarau, P.: Textrank: Bringing order into text (2004)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Wu, H., et al.: Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6609–6618 (2019)
Google Scholar
Wu, Y., Wang, S., Song, G., Huang, Q.: Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2088–2096 (2019)
Google Scholar
Zang, X., Liu, L., Wang, M., Song, Y., Zhang, H., Chen, J.: Photochat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. arXiv preprint arXiv:2108.01453 (2021)

Download references

Acknowledgments

This research was supported by the National Key Research and Development Program of China (Grant No. 2022YFB3103100), the National Natural Science Foundation of China (Grant No. 62276245), and Anhui Provincial Natural Science Foundation (Grant No. 2008085J31).

Author information

Authors and Affiliations

State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
Zhen Liu, Yongxin Zhu, Zhujin Gao, Xin Sheng & Linli Xu

Authors

Zhen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yongxin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zhujin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xin Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Linli Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Linli Xu .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Hisashi Kashima
IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Tsuyoshi Ide
National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Z., Zhu, Y., Gao, Z., Sheng, X., Xu, L. (2023). ItrievalKD: An Iterative Retrieval Framework Assisted with Knowledge Distillation for Noisy Text-to-Image Retrieval. In: Kashima, H., Ide, T., Peng, WC. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2023. Lecture Notes in Computer Science(), vol 13937. Springer, Cham. https://doi.org/10.1007/978-3-031-33380-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-33380-4_20
Published: 27 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33379-8
Online ISBN: 978-3-031-33380-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ItrievalKD: An Iterative Retrieval Framework Assisted with Knowledge Distillation for Noisy Text-to-Image Retrieval