Skip to main content
Log in

Conditional Embedding Pre-Training Language Model for Image Captioning

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

The pre-trained language model can not only learn language representations with different granularity from a large number of corpus, but also provide a good initialization for downstream tasks. Aggregation or alignment of text features and visual features as input of pre-training language model is the mainstream approach to deal with visual-language tasks. People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we no longer follow mainstream approach, and propose to adjust the pre-training language model processing by using high-low visual features as conditional inputs. Specifically, conditional embedding layer normalization (CELN) we proposed is an effective mechanism for embedding visual features into pre-training language models for feature selection. We apply CELN to transformers in the unified pre-training language model (UNILM). This model parameter adjustment mechanism is an innovative attempt in pre-training language model. Extensive experiments on two challenging benchmarks (MSCOCO and Visual Genome datasets)demonstrate that this seminal work is effective. Code and models are publicly available at: https://github.com/lpfworld/CE-UNILM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Kiros R, Salakhutdinov R, Zemel R (2014) Multi-modal neural language models, In: Proceedings of the 31st international conference on machine learning, pp 595–603

  2. Xu K, Ba JL, Kiros R, Cho K, Courville AC, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention, In: Proceedings of the 32nd international conference on machine learning, pp 2048–2057

  3. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering, In: 2018 IEEE conference on computer vision and pattern recognition, pp 6077–6086

  4. Mathews AP, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments, In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 3574-3580

  5. Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding, In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics, pp 4171-4186

  6. Su SY, Chuang YS, Chen YN (2020) Dual inference for improving language understanding and generation, In: Proceedings of the 2020 conference on empirical methods in natural language processing, pp 4930-4936

  7. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) VL-BERT: Pre-training of Generic Visual-Linguistic Representations, In: 8th international conference on learning representations

  8. Zhou L, Palangi H, Zhang L, Hu H, Corso JJ, Gao J (2020) Unified vision-language pre-training for image captioning and VQA, The 34th AAAI conference on artificial intelligence, pp 13041-13049

  9. Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) VideoBERT: A joint model for video and language representation learning, In: 2019 IEEE/CVF international conference on computer vision, pp 7463-7472

  10. De Vries H, Strub F, Mary J, Larochelle H, Pietquin O, Courville AC (2017) Modulating early visual processing by language, In: Annual conference on neural information processing systems, pp 6594-6604

  11. Miyato T, Koyama M (2018) cGANs with Projection Discriminator, In: 6th international conference on learning representations

  12. Zhang H, Goodfellow IJ, Metaxas DN, Odena A (2019) Self-attention generative adversarial networks, In: Proceedings of the 36th international conference on machine learning, pp 7354-7363

  13. Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Berg Alexander C, Berg Tamara L (2013) BabyTalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 12:2891–903

    Google Scholar 

  14. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images, In: 11th European conference on computer vision, pp 15-29

  15. Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (2017) Show and Tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 4:652–63

    Google Scholar 

  16. Karpathy A, Li F-F (2014) Deep visual-semantic alignments for generating image descriptions, 2014 IEEE conference on computer vision and pattern recognition, pp 3128-3137

  17. Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation, In: 2015 IEEE international conference on computer vision, pp 2407-2415

  18. Wu Q, Shen C, Liu L, Dick AR, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?, In: 2016 IEEE international conference on computer vision, pp 203-212

  19. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning, In: 2017 IEEE conference on computer vision and pattern recognition, pp 3242-3250

  20. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning, In: 2017 IEEE conference on computer vision and pattern recognition, pp 6298-6306

  21. Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk, In: 2018 IEEE conference on computer vision and pattern recognition, pp 7219-7228

  22. Johnson J, Krishna R, Stark M, Li L-J, Shamma DA, Bernstein MS, Li F-F (2017) Image retrieval using scene graphs, page numbers. In: 2017 IEEE conference on computer vision and pattern recognition, pp 3668-3678

  23. Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning, In: 2019 IEEE conference on computer vision and pattern recognition, pp 10685–10694

  24. Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning, computer vision-ECCV 2018-15th European conference, pp 711-727

  25. Chen S, Jin Q, Wang P, Wu Q (2019) Say as you wish: fine-grained control of image caption generation with abstract scene graphs, In: 2020 IEEE conference on computer vision and pattern recognition, pp 9959-9968

  26. Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning, 2020 IEEE/CVF conference on computer vision and pattern recognition, pp 10324-10333

  27. Liu Sheng, Ren Zhou, Junsong Yuan (2021) Sibnet: Sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell 43(9):3259–72

    Article  Google Scholar 

  28. Gao Z, Wang Y, Xiong J, Pan Y, Huang Y (2020) Structural balance control of complex dynamical networks based on state observer for dynamic connection relationships. Complex. pp 5075487:1-5075487:9

  29. Yu F, Zhang ZN, Liu L, Shen H, Huang Y, Shi C, Cai S, Song Y, Du S, Xu Q (2020) Secure communication scheme based on a new 5D multistable Four-Wing Memristive hyperchaotic system with disturbance inputs, Complex. pp 5859273:1-5859273:16

  30. Xiang L, Guo G, Yu J, Sheng VS, Yang P (2020) A convolutional neural network-based linguistic steganalysis for synonym substitution steganography. Math Biosci Eng 17(2):1041–1058

    Article  MathSciNet  MATH  Google Scholar 

  31. Mahia RN, Fulwani DM (2018) On some input-output dynamic properties of complex networks. IEEE Trans Circuits Syst II: Express Briefs 65(2):216–220

    Article  Google Scholar 

  32. Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey, CoRR arXiv:2003.08271

  33. Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Wang Yu, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon (2019) Unified language model pre-training for natural language understanding and generation. Annual Conf Neural Inf Process Syst 2019:13042–13054

    Google Scholar 

  34. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift, In: Proceedings of the 32nd international conference on machine learning, pp 448-456

  35. Ba LJ, Kiros JR, Hinton GE (2016) Layer normalization, CoRR arXiv:1607.06450

  36. Ulyanov D, Vedaldi A, Lempitsky VS (2016) Instance normalization: the missing ingredient for fast stylization, CoRR arXiv:1607.08022

  37. Wu Yuxin, He Kaiming (2020) Group Normalization. Int J Comput Vision 128(3):742–755

    Article  Google Scholar 

  38. Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code, In: 2017 IEEE international conference on computer vision, pp 5561–5569

  39. Ren S, He K, Girshick R, Sun J (2015) Faster RCNN: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99

    Google Scholar 

  40. Schuster S, Krishna R, Chang A, Fei-Fei L, Manning CD (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval, In Proceedings of the fourth workshop on vision and language, pp 70–80

  41. Miller George A (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41

    Article  Google Scholar 

  42. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, In: 2016 IEEE conference on computer vision and pattern recognition, pp 770-778, (2016)

  43. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255

  44. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft COCO: Common objects in context, In Proceedings of the European conference on computer vision, pp 740–755

  45. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73

    Article  MathSciNet  Google Scholar 

  46. Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: Semantic propositional image caption evaluation, computer vision-ECCV 2016-14th European conference, pp 382-398

  47. Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  48. Denkowski MJ, Lavie A (2014) Meteor Universal: language specific translation evaluation for any target language, In: Proceedings of the Ninth workshop on statistical machine translation, pp 376-380

  49. Lin C-Y, Hovy EH (2010) Rouge: a package for automatic evaluation of summaries, Workshop on Text summarization branches out at ACL

  50. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation, In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311-318

  51. Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning, In Proceedings of the IEEE international conference on computer vision, pp 4634–4643

  52. Yang Zhilin, Yuan Ye, Yuexin Wu, Cohen William W, Salakhutdinov Ruslan (2016) Review networks for caption generation. Annual Conf Neural Inf Process Syst 2016:2361–2369

    Google Scholar 

  53. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning, 2017 IEEE conference on computer vision and pattern recognition, pp 1179-1195

  54. Liu D, Zha Z-J, Zhang H, Zhang Y, Wu F (2018) Context-aware visual policy network for sequence-level image captioning, In 2018 ACM multimedia conference on multimedia conference, pp 1416– 1424

  55. Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares (2019) Image captioning: transforming objects into words. Annual Conf Neural Inf Process Syst 2019:11135–11145

    Google Scholar 

  56. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

Download references

Acknowledgements

This work is supported by Zhejiang Provincial Technical Plan Project(No.2020C03105, 2021C01129).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, P., Zhang, M., Lin, P. et al. Conditional Embedding Pre-Training Language Model for Image Captioning. Neural Process Lett 54, 4987–5003 (2022). https://doi.org/10.1007/s11063-022-10844-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-022-10844-3

Keywords

Navigation