Skip to main content

TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13422))

  • 752 Accesses

Abstract

Visual-Linguistic (VL) pre-training is gaining increasing interest due to its ability to learn generic VL representations that can be used for downstream cross-modal tasks. However, the lack of large-scale and high-quality parallel corpora makes VL pre-training impractical for low-resource languages. Therefore, it is desirable to leverage existing well-trained English VL models for cross-modal tasks in other languages. But a basic approach suffers from its inability to capture the semantic correlation between different modalities and insufficient utilization of the hierarchical representations of VL models. In this work, we propose TraVL, a novel framework for transferring pre-trained VL models for cross-lingual image captioning. To enforce the semantic alignment during modality fusion, TraVL employs joint attention that constructs the key-value pair by concatenating the visual and linguistic representations. To fully exploit the hierarchical visual information, we develop an adjacent layer-fusion mechanism that allows each decoder layer to attend to the encoder’s multilayer representations with similar semantics. Experiments on a Chinese image-text dataset show that TraVL outperforms state-of-the-art captioning models and other transfer learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/LuoweiZhou/VLP.

  2. 2.

    https://github.com/Morizeyao/GPT2-Chinese.

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)

    Google Scholar 

  2. Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In: CVPR. pp. 5561–5570 (2018). https://doi.org/10.1109/CVPR.2018.00583

  3. Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: Visualgpt: Data-efficient image captioning by balancing visual input and linguistic knowledge from pretraining. CoRR abs/2102.10407 (2021)

    Google Scholar 

  4. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: CVPR, pp. 10575–10584 (2020). https://doi.org/10.1109/CVPR42600.2020.01059

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423

  6. Elliott, D., Frank, S., Hasler, E.: Multi-language image description with neural sequence models. CoRR abs/1510.04709 (2015)

    Google Scholar 

  7. Farhadi, A., et al.: Every picture tells a story: Generating sentences from images. In: ECCV, pp. 15–29 (2010)

    Google Scholar 

  8. Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. In: NMT@ACL. pp. 56–60 (2017). https://doi.org/10.18653/v1/w17-3207

  9. He, T., et al.: Layer-wise coordination between encoder and decoder for neural machine translation. In: NeurIPS, pp. 7955–7965 (2018)

    Google Scholar 

  10. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. JAIR 47(1), 853–899 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  11. Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: ICCV, pp. 4633–4642 (2019). https://doi.org/10.1109/ICCV.2019.00473

  12. Ichikawa, K., Tamano, H.: Unsupervised qualitative scoring for binary item features. Data Sci. Eng. 5(3), 317–330 (2020)

    Article  Google Scholar 

  13. Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language? In: ACL, pp. 3651–3657 (2019). https://doi.org/10.18653/v1/p19-1356

  14. Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: NeurIPS, pp. 1889–1897 (2014)

    Google Scholar 

  15. Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)

    Google Scholar 

  16. Kulkarni, G., et al.: Baby talk: Understanding and generating simple image descriptions. In: CVPR, pp. 1601–1608 (2011). https://doi.org/10.1109/CVPR.2011.5995466

  17. Lan, W., Li, X., Dong, J.: Fluency-guided cross-lingual image captioning. In: ACM Multimedia, pp. 1549–1557 (2017). https://doi.org/10.1145/3123266.3123366

  18. Li, W., et al.: UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In: ACL-IJCNLP, pp. 2592–2607 (2021). https://doi.org/10.18653/v1/2021.acl-long.202

  19. Li, X., et al.: Coco-cn for cross-lingual image tagging, captioning, and retrieval. IEEE Multimedia 21(9), 2347–2360 (2019). https://doi.org/10.1109/TMM.2019.2896494

    Article  Google Scholar 

  20. Li, X., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: ECCV, pp. 121–137 (2020)

    Google Scholar 

  21. Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: ECCV, pp. 740–755 (2014)

    Google Scholar 

  22. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  23. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS, pp. 13–23 (2019)

    Google Scholar 

  24. Miyazaki, T., Shimizu, N.: Cross-lingual image caption generation. In: ACL, pp. 1780–1790 (2016). https://doi.org/10.18653/v1/P16-1168

  25. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. Tech. rep, OpenAI (2019)

    Google Scholar 

  26. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NeurIPS, pp. 91–99 (2015)

    Google Scholar 

  27. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL, pp. 2556–2565 (2018). https://doi.org/10.18653/v1/P18-1238

  28. Su, W., et al.: Vl-bert: Pre-training of generic visual-linguistic representations. In: ICLR (2020)

    Google Scholar 

  29. Tsutsui, S., Crandall, D.J.: Using artificial tokens to control languages for multilingual image caption generation. CoRR abs/1706.06275 (2017)

    Google Scholar 

  30. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 6000–6010 (2017)

    Google Scholar 

  31. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR, pp. 3156–3164 (2015). https://doi.org/10.1109/CVPR.2015.7298935

  32. Wang, B., Wang, C., Zhang, Q., Su, Y., Wang, Y., Xu, Y.: Cross-lingual image caption generation based on visual attention model. IEEE Access 8, 104543–104554 (2020). https://doi.org/10.1109/ACCESS.2020.2999568

    Article  Google Scholar 

  33. Weng, R., Yu, H., Huang, S., Cheng, S., Luo, W.: Acquiring knowledge from pre-trained model to neural machine translation. In: AAAI, pp. 9266–9273 (2020)

    Google Scholar 

  34. Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML. vol. 37, pp. 2048–2057 (2015)

    Google Scholar 

  35. Xu, L., Zhang, X., Dong, Q.: Cluecorpus 2020: A large-scale chinese corpus for pre-training language model. CoRR abs/2003.01355 (2020)

    Google Scholar 

  36. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI, pp. 13041–13049 (2020)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Key Research Program of Zhejiang Province (Grant No.2021C01109).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Lu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Z., Lu, P., Jiang, D., Chen, G. (2023). TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning. In: Li, B., Yue, L., Tao, C., Han, X., Calvanese, D., Amagasa, T. (eds) Web and Big Data. APWeb-WAIM 2022. Lecture Notes in Computer Science, vol 13422. Springer, Cham. https://doi.org/10.1007/978-3-031-25198-6_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25198-6_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25197-9

  • Online ISBN: 978-3-031-25198-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics