TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning

Zhang, Zhebin; Lu, Peng; Jiang, Dawei; Chen, Gang

doi:10.1007/978-3-031-25198-6_26

Zhebin Zhang¹³,
Peng Lu¹⁴,
Dawei Jiang¹³ &
…
Gang Chen¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13422))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

752 Accesses

Abstract

Visual-Linguistic (VL) pre-training is gaining increasing interest due to its ability to learn generic VL representations that can be used for downstream cross-modal tasks. However, the lack of large-scale and high-quality parallel corpora makes VL pre-training impractical for low-resource languages. Therefore, it is desirable to leverage existing well-trained English VL models for cross-modal tasks in other languages. But a basic approach suffers from its inability to capture the semantic correlation between different modalities and insufficient utilization of the hierarchical representations of VL models. In this work, we propose TraVL, a novel framework for transferring pre-trained VL models for cross-lingual image captioning. To enforce the semantic alignment during modality fusion, TraVL employs joint attention that constructs the key-value pair by concatenating the visual and linguistic representations. To fully exploit the hierarchical visual information, we develop an adjacent layer-fusion mechanism that allows each decoder layer to attend to the encoder’s multilayer representations with similar semantics. Experiments on a Chinese image-text dataset show that TraVL outperforms state-of-the-art captioning models and other transfer learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Google Scholar
Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In: CVPR. pp. 5561–5570 (2018). https://doi.org/10.1109/CVPR.2018.00583
Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: Visualgpt: Data-efficient image captioning by balancing visual input and linguistic knowledge from pretraining. CoRR abs/2102.10407 (2021)
Google Scholar
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: CVPR, pp. 10575–10584 (2020). https://doi.org/10.1109/CVPR42600.2020.01059
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423
Elliott, D., Frank, S., Hasler, E.: Multi-language image description with neural sequence models. CoRR abs/1510.04709 (2015)
Google Scholar
Farhadi, A., et al.: Every picture tells a story: Generating sentences from images. In: ECCV, pp. 15–29 (2010)
Google Scholar
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. In: NMT@ACL. pp. 56–60 (2017). https://doi.org/10.18653/v1/w17-3207
He, T., et al.: Layer-wise coordination between encoder and decoder for neural machine translation. In: NeurIPS, pp. 7955–7965 (2018)
Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. JAIR 47(1), 853–899 (2013)
Article MathSciNet MATH Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: ICCV, pp. 4633–4642 (2019). https://doi.org/10.1109/ICCV.2019.00473
Ichikawa, K., Tamano, H.: Unsupervised qualitative scoring for binary item features. Data Sci. Eng. 5(3), 317–330 (2020)
Article Google Scholar
Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language? In: ACL, pp. 3651–3657 (2019). https://doi.org/10.18653/v1/p19-1356
Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: NeurIPS, pp. 1889–1897 (2014)
Google Scholar
Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)
Google Scholar
Kulkarni, G., et al.: Baby talk: Understanding and generating simple image descriptions. In: CVPR, pp. 1601–1608 (2011). https://doi.org/10.1109/CVPR.2011.5995466
Lan, W., Li, X., Dong, J.: Fluency-guided cross-lingual image captioning. In: ACM Multimedia, pp. 1549–1557 (2017). https://doi.org/10.1145/3123266.3123366
Li, W., et al.: UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In: ACL-IJCNLP, pp. 2592–2607 (2021). https://doi.org/10.18653/v1/2021.acl-long.202
Li, X., et al.: Coco-cn for cross-lingual image tagging, captioning, and retrieval. IEEE Multimedia 21(9), 2347–2360 (2019). https://doi.org/10.1109/TMM.2019.2896494
Article Google Scholar
Li, X., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: ECCV, pp. 121–137 (2020)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: ECCV, pp. 740–755 (2014)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS, pp. 13–23 (2019)
Google Scholar
Miyazaki, T., Shimizu, N.: Cross-lingual image caption generation. In: ACL, pp. 1780–1790 (2016). https://doi.org/10.18653/v1/P16-1168
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. Tech. rep, OpenAI (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NeurIPS, pp. 91–99 (2015)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL, pp. 2556–2565 (2018). https://doi.org/10.18653/v1/P18-1238
Su, W., et al.: Vl-bert: Pre-training of generic visual-linguistic representations. In: ICLR (2020)
Google Scholar
Tsutsui, S., Crandall, D.J.: Using artificial tokens to control languages for multilingual image caption generation. CoRR abs/1706.06275 (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 6000–6010 (2017)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR, pp. 3156–3164 (2015). https://doi.org/10.1109/CVPR.2015.7298935
Wang, B., Wang, C., Zhang, Q., Su, Y., Wang, Y., Xu, Y.: Cross-lingual image caption generation based on visual attention model. IEEE Access 8, 104543–104554 (2020). https://doi.org/10.1109/ACCESS.2020.2999568
Article Google Scholar
Weng, R., Yu, H., Huang, S., Cheng, S., Luo, W.: Acquiring knowledge from pre-trained model to neural machine translation. In: AAAI, pp. 9266–9273 (2020)
Google Scholar
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML. vol. 37, pp. 2048–2057 (2015)
Google Scholar
Xu, L., Zhang, X., Dong, Q.: Cluecorpus 2020: A large-scale chinese corpus for pre-training language model. CoRR abs/2003.01355 (2020)
Google Scholar
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI, pp. 13041–13049 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by the Key Research Program of Zhejiang Province (Grant No.2021C01109).

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Zhebin Zhang, Dawei Jiang & Gang Chen
Institute of Computing Innovation, Zhejiang University, Hangzhou, China
Peng Lu

Authors

Zhebin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Lu
View author publications
You can also search for this author in PubMed Google Scholar
Dawei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Gang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peng Lu .

Editor information

Editors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Bohan Li
Newcastle University, Callaghan, NSW, Australia
Lin Yue
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Chuanqi Tao
Jinan University, Guangzhou, China
Xuming Han
Free University of Bozen-Bolzano, Bolzano, Italy
Diego Calvanese
University of Tsukuba, Tsukuba, Japan
Toshiyuki Amagasa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Lu, P., Jiang, D., Chen, G. (2023). TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning. In: Li, B., Yue, L., Tao, C., Han, X., Calvanese, D., Amagasa, T. (eds) Web and Big Data. APWeb-WAIM 2022. Lecture Notes in Computer Science, vol 13422. Springer, Cham. https://doi.org/10.1007/978-3-031-25198-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-25198-6_26
Published: 10 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25197-9
Online ISBN: 978-3-031-25198-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning