TSCMR:Two-Stage Cross-Modal Retrieval

Chen, Zhihao; Wang, Hongya

doi:10.1007/978-3-031-46674-8_39

Zhihao Chen^15,17 &
Hongya Wang^15,16,17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14179))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

652 Accesses

Abstract

Currently, large-scale vision and language models has significantly improved the performances of cross-modal retrieval tasks. However, large-scale models require a substantial amount of computing resources, so the execution of these models on devices with limited resources is challenging. Thus, it is paramount to reduce the model size and minimize computing costs of a model without sacrificing its performance. In this paper, we improved TERAN by dividing cross-modal retrieval into two stages: image-text coarse-grained matching and image-text fine-grained matching. Specifically, we present a novel approach called Two-Stage Cross-Modal Retrieval network(TSCMR). To reduce model size after model training, our approach utilized a new knowledge distillation method for Transformer-based models. Experiments have shown that our approach maintains a performance comparable to TERAN on the MS-COCO 1K test set, while being 2x smaller and 3.1x faster on inference.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Article 03 May 2023

Cross-Modal Retrieval with Discriminative Dual-Path CNN

Scene-Centric vs. Object-Centric Image-Text Cross-Modal Retrieval: A Reproducibility Study

References

Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, 30 (2017)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., et al.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, G., Duan, N., Fang, Y., et al.: Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. Proc. AAAI Conf. Artif. Intell. 34(07), 11336–11344 (2020)
Google Scholar
Qi, D., Su, L., Song, J., et al.: Imagebert: cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020)
Lu, J., Batra, D., Parikh, D., et al.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, 32 (2019)
Google Scholar
Tan, H., Bansal, M., Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
Huang, Z., Zeng, Z., Liu, B., et al.: Pixel-Bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)
Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, 28 (2015)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Devlin, J., Chang, M.W., Lee, K., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Messina, N., Amato, G., Esuli, A., et al.: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 17(4), 1–23 (2021)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Freitag, M., Al-Onaizan, Y., Sankaran, B.: Ensemble distillation for neural machine translation. arXiv preprint arXiv:1702.01802 (2017)
Jiao, X., Yin, Y., Shang, L., et al.: Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)
Sanh, V., Debut, L., Chaumond, J., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sun, Z., Yu, H., Song, X., et al.: Mobilebert: a compact task-agnostic BERT for resource-limited devices. arXiv preprint arXiv:2004.02984 (2020)
Xia, M., et al.: Structured pruning learns compact and accurate models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1513–1528 (2022)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC. BMV A Press, 12 (2018)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Google Scholar
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4654–4662 (2019)
Google Scholar
Qu, L., Liu, M., Cao, D., Nie, L., Tian, Q.: Context-aware multi-view summarization network for image-text matching. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1047–1055 (2020)
Google Scholar
Wang, Y., et al.: Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748 (2019)
Wei, X., Zhang, T., Li, Y., Zhang, Y., Wu, F.: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10941–10950 (2020)
Google Scholar

Download references

Acknowledgments

The work reported in this paper is partially supported by NSF of Shanghai under grant number 22ZR1402000, the Fundamental Research Funds for the Central Universities under grant number 2232021A-08, State Key Laboratory of Computer Architecture (ICT,CAS) under Grant No. CARCHB 202118, Information Development Project of Shanghai Economic and Information Commission (202002009) and National Natural Science Foundation of China (No. 61906035).

Author information

Authors and Affiliations

School of Computer Science and Technology, Donghua University, Shanghai, China
Zhihao Chen & Hongya Wang
State Key Laboratory of Computer Architecture, Institute of Computing Technology, CAS, Beijing, China
Hongya Wang
Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai, China
Zhihao Chen & Hongya Wang

Authors

Zhihao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hongya Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongya Wang .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Xiaochun Yang
The University of Indonesia, Depok, Indonesia
Heru Suhartanto
Beijing Institute of Technology, Beijing, China
Guoren Wang
Northeastern University, Shenyang, China
Bin Wang
University of Technology Sydney, Sydney, NSW, Australia
Jing Jiang
Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
Bing Li
Sun Yat-sen University, Guangzhou, China
Huaijie Zhu
Anhui University, Hefei, China
Ningning Cui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Z., Wang, H. (2023). TSCMR:Two-Stage Cross-Modal Retrieval. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14179. Springer, Cham. https://doi.org/10.1007/978-3-031-46674-8_39

Download citation

DOI: https://doi.org/10.1007/978-3-031-46674-8_39
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46673-1
Online ISBN: 978-3-031-46674-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TSCMR:Two-Stage Cross-Modal Retrieval

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Cross-Modal Retrieval with Discriminative Dual-Path CNN

Scene-Centric vs. Object-Centric Image-Text Cross-Modal Retrieval: A Reproducibility Study

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

TSCMR:Two-Stage Cross-Modal Retrieval

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Cross-Modal Retrieval with Discriminative Dual-Path CNN

Scene-Centric vs. Object-Centric Image-Text Cross-Modal Retrieval: A Reproducibility Study

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation