Abstract
Visual-Linguistic (VL) pre-training is gaining increasing interest due to its ability to learn generic VL representations that can be used for downstream cross-modal tasks. However, the lack of large-scale and high-quality parallel corpora makes VL pre-training impractical for low-resource languages. Therefore, it is desirable to leverage existing well-trained English VL models for cross-modal tasks in other languages. But a basic approach suffers from its inability to capture the semantic correlation between different modalities and insufficient utilization of the hierarchical representations of VL models. In this work, we propose TraVL, a novel framework for transferring pre-trained VL models for cross-lingual image captioning. To enforce the semantic alignment during modality fusion, TraVL employs joint attention that constructs the key-value pair by concatenating the visual and linguistic representations. To fully exploit the hierarchical visual information, we develop an adjacent layer-fusion mechanism that allows each decoder layer to attend to the encoder’s multilayer representations with similar semantics. Experiments on a Chinese image-text dataset show that TraVL outperforms state-of-the-art captioning models and other transfer learning methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In: CVPR. pp. 5561–5570 (2018). https://doi.org/10.1109/CVPR.2018.00583
Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: Visualgpt: Data-efficient image captioning by balancing visual input and linguistic knowledge from pretraining. CoRR abs/2102.10407 (2021)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: CVPR, pp. 10575–10584 (2020). https://doi.org/10.1109/CVPR42600.2020.01059
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423
Elliott, D., Frank, S., Hasler, E.: Multi-language image description with neural sequence models. CoRR abs/1510.04709 (2015)
Farhadi, A., et al.: Every picture tells a story: Generating sentences from images. In: ECCV, pp. 15–29 (2010)
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. In: NMT@ACL. pp. 56–60 (2017). https://doi.org/10.18653/v1/w17-3207
He, T., et al.: Layer-wise coordination between encoder and decoder for neural machine translation. In: NeurIPS, pp. 7955–7965 (2018)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. JAIR 47(1), 853–899 (2013)
Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: ICCV, pp. 4633–4642 (2019). https://doi.org/10.1109/ICCV.2019.00473
Ichikawa, K., Tamano, H.: Unsupervised qualitative scoring for binary item features. Data Sci. Eng. 5(3), 317–330 (2020)
Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language? In: ACL, pp. 3651–3657 (2019). https://doi.org/10.18653/v1/p19-1356
Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: NeurIPS, pp. 1889–1897 (2014)
Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)
Kulkarni, G., et al.: Baby talk: Understanding and generating simple image descriptions. In: CVPR, pp. 1601–1608 (2011). https://doi.org/10.1109/CVPR.2011.5995466
Lan, W., Li, X., Dong, J.: Fluency-guided cross-lingual image captioning. In: ACM Multimedia, pp. 1549–1557 (2017). https://doi.org/10.1145/3123266.3123366
Li, W., et al.: UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In: ACL-IJCNLP, pp. 2592–2607 (2021). https://doi.org/10.18653/v1/2021.acl-long.202
Li, X., et al.: Coco-cn for cross-lingual image tagging, captioning, and retrieval. IEEE Multimedia 21(9), 2347–2360 (2019). https://doi.org/10.1109/TMM.2019.2896494
Li, X., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: ECCV, pp. 121–137 (2020)
Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: ECCV, pp. 740–755 (2014)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS, pp. 13–23 (2019)
Miyazaki, T., Shimizu, N.: Cross-lingual image caption generation. In: ACL, pp. 1780–1790 (2016). https://doi.org/10.18653/v1/P16-1168
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. Tech. rep, OpenAI (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NeurIPS, pp. 91–99 (2015)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL, pp. 2556–2565 (2018). https://doi.org/10.18653/v1/P18-1238
Su, W., et al.: Vl-bert: Pre-training of generic visual-linguistic representations. In: ICLR (2020)
Tsutsui, S., Crandall, D.J.: Using artificial tokens to control languages for multilingual image caption generation. CoRR abs/1706.06275 (2017)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 6000–6010 (2017)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR, pp. 3156–3164 (2015). https://doi.org/10.1109/CVPR.2015.7298935
Wang, B., Wang, C., Zhang, Q., Su, Y., Wang, Y., Xu, Y.: Cross-lingual image caption generation based on visual attention model. IEEE Access 8, 104543–104554 (2020). https://doi.org/10.1109/ACCESS.2020.2999568
Weng, R., Yu, H., Huang, S., Cheng, S., Luo, W.: Acquiring knowledge from pre-trained model to neural machine translation. In: AAAI, pp. 9266–9273 (2020)
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML. vol. 37, pp. 2048–2057 (2015)
Xu, L., Zhang, X., Dong, Q.: Cluecorpus 2020: A large-scale chinese corpus for pre-training language model. CoRR abs/2003.01355 (2020)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI, pp. 13041–13049 (2020)
Acknowledgements
This work was supported by the Key Research Program of Zhejiang Province (Grant No.2021C01109).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Z., Lu, P., Jiang, D., Chen, G. (2023). TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning. In: Li, B., Yue, L., Tao, C., Han, X., Calvanese, D., Amagasa, T. (eds) Web and Big Data. APWeb-WAIM 2022. Lecture Notes in Computer Science, vol 13422. Springer, Cham. https://doi.org/10.1007/978-3-031-25198-6_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-25198-6_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25197-9
Online ISBN: 978-3-031-25198-6
eBook Packages: Computer ScienceComputer Science (R0)