Skip to main content

TSCMR:Two-Stage Cross-Modal Retrieval

  • Conference paper
  • First Online:
Advanced Data Mining and Applications (ADMA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14179))

Included in the following conference series:

  • 652 Accesses

Abstract

Currently, large-scale vision and language models has significantly improved the performances of cross-modal retrieval tasks. However, large-scale models require a substantial amount of computing resources, so the execution of these models on devices with limited resources is challenging. Thus, it is paramount to reduce the model size and minimize computing costs of a model without sacrificing its performance. In this paper, we improved TERAN by dividing cross-modal retrieval into two stages: image-text coarse-grained matching and image-text fine-grained matching. Specifically, we present a novel approach called Two-Stage Cross-Modal Retrieval network(TSCMR). To reduce model size after model training, our approach utilized a new knowledge distillation method for Transformer-based models. Experiments have shown that our approach maintains a performance comparable to TERAN on the MS-COCO 1K test set, while being 2x smaller and 3.1x faster on inference.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, 30 (2017)

    Google Scholar 

  2. Li, L.H., Yatskar, M., Yin, D., et al.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  3. Li, G., Duan, N., Fang, Y., et al.: Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. Proc. AAAI Conf. Artif. Intell. 34(07), 11336–11344 (2020)

    Google Scholar 

  4. Qi, D., Su, L., Song, J., et al.: Imagebert: cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020)

  5. Lu, J., Batra, D., Parikh, D., et al.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, 32 (2019)

    Google Scholar 

  6. Tan, H., Bansal, M., Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)

  7. Huang, Z., Zeng, Z., Liu, B., et al.: Pixel-Bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)

  8. Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, 28 (2015)

    Google Scholar 

  9. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  10. Devlin, J., Chang, M.W., Lee, K., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  11. Messina, N., Amato, G., Esuli, A., et al.: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 17(4), 1–23 (2021)

    Google Scholar 

  12. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  13. Freitag, M., Al-Onaizan, Y., Sankaran, B.: Ensemble distillation for neural machine translation. arXiv preprint arXiv:1702.01802 (2017)

  14. Jiao, X., Yin, Y., Shang, L., et al.: Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)

  15. Sanh, V., Debut, L., Chaumond, J., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  16. Sun, Z., Yu, H., Song, X., et al.: Mobilebert: a compact task-agnostic BERT for resource-limited devices. arXiv preprint arXiv:2004.02984 (2020)

  17. Xia, M., et al.: Structured pruning learns compact and accurate models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1513–1528 (2022)

    Google Scholar 

  18. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC. BMV A Press, 12 (2018)

    Google Scholar 

  19. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

    Google Scholar 

  20. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)

    Google Scholar 

  21. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  22. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)

    Google Scholar 

  23. Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4654–4662 (2019)

    Google Scholar 

  24. Qu, L., Liu, M., Cao, D., Nie, L., Tian, Q.: Context-aware multi-view summarization network for image-text matching. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1047–1055 (2020)

    Google Scholar 

  25. Wang, Y., et al.: Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748 (2019)

  26. Wei, X., Zhang, T., Li, Y., Zhang, Y., Wu, F.: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10941–10950 (2020)

    Google Scholar 

Download references

Acknowledgments

The work reported in this paper is partially supported by NSF of Shanghai under grant number 22ZR1402000, the Fundamental Research Funds for the Central Universities under grant number 2232021A-08, State Key Laboratory of Computer Architecture (ICT,CAS) under Grant No. CARCHB 202118, Information Development Project of Shanghai Economic and Information Commission (202002009) and National Natural Science Foundation of China (No. 61906035).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongya Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Z., Wang, H. (2023). TSCMR:Two-Stage Cross-Modal Retrieval. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14179. Springer, Cham. https://doi.org/10.1007/978-3-031-46674-8_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-46674-8_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-46673-1

  • Online ISBN: 978-3-031-46674-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics