Conditional Embedding Pre-Training Language Model for Image Captioning

Li, Pengfei; Zhang, Min; Lin, Peijie; Wan, Jian; Jiang, Ming

doi:10.1007/s11063-022-10844-3

Conditional Embedding Pre-Training Language Model for Image Captioning

Published: 14 June 2022

Volume 54, pages 4987–5003, (2022)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Pengfei Li¹,
Min Zhang ORCID: orcid.org/0000-0003-1584-3994¹,
Peijie Lin¹,
Jian Wan¹ &
…
Ming Jiang¹

263 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The pre-trained language model can not only learn language representations with different granularity from a large number of corpus, but also provide a good initialization for downstream tasks. Aggregation or alignment of text features and visual features as input of pre-training language model is the mainstream approach to deal with visual-language tasks. People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we no longer follow mainstream approach, and propose to adjust the pre-training language model processing by using high-low visual features as conditional inputs. Specifically, conditional embedding layer normalization (CELN) we proposed is an effective mechanism for embedding visual features into pre-training language models for feature selection. We apply CELN to transformers in the unified pre-training language model (UNILM). This model parameter adjustment mechanism is an innovative attempt in pre-training language model. Extensive experiments on two challenging benchmarks (MSCOCO and Visual Genome datasets)demonstrate that this seminal work is effective. Code and models are publicly available at: https://github.com/lpfworld/CE-UNILM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning

A local representation-enhanced recurrent convolutional network for image captioning

Article 12 April 2022

Learning cross-modality features for image caption generation

Article 25 March 2022

References

Kiros R, Salakhutdinov R, Zemel R (2014) Multi-modal neural language models, In: Proceedings of the 31st international conference on machine learning, pp 595–603
Xu K, Ba JL, Kiros R, Cho K, Courville AC, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention, In: Proceedings of the 32nd international conference on machine learning, pp 2048–2057
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering, In: 2018 IEEE conference on computer vision and pattern recognition, pp 6077–6086
Mathews AP, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments, In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 3574-3580
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding, In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics, pp 4171-4186
Su SY, Chuang YS, Chen YN (2020) Dual inference for improving language understanding and generation, In: Proceedings of the 2020 conference on empirical methods in natural language processing, pp 4930-4936
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) VL-BERT: Pre-training of Generic Visual-Linguistic Representations, In: 8th international conference on learning representations
Zhou L, Palangi H, Zhang L, Hu H, Corso JJ, Gao J (2020) Unified vision-language pre-training for image captioning and VQA, The 34th AAAI conference on artificial intelligence, pp 13041-13049
Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) VideoBERT: A joint model for video and language representation learning, In: 2019 IEEE/CVF international conference on computer vision, pp 7463-7472
De Vries H, Strub F, Mary J, Larochelle H, Pietquin O, Courville AC (2017) Modulating early visual processing by language, In: Annual conference on neural information processing systems, pp 6594-6604
Miyato T, Koyama M (2018) cGANs with Projection Discriminator, In: 6th international conference on learning representations
Zhang H, Goodfellow IJ, Metaxas DN, Odena A (2019) Self-attention generative adversarial networks, In: Proceedings of the 36th international conference on machine learning, pp 7354-7363
Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Berg Alexander C, Berg Tamara L (2013) BabyTalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 12:2891–903
Google Scholar
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images, In: 11th European conference on computer vision, pp 15-29
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (2017) Show and Tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 4:652–63
Google Scholar
Karpathy A, Li F-F (2014) Deep visual-semantic alignments for generating image descriptions, 2014 IEEE conference on computer vision and pattern recognition, pp 3128-3137
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation, In: 2015 IEEE international conference on computer vision, pp 2407-2415
Wu Q, Shen C, Liu L, Dick AR, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?, In: 2016 IEEE international conference on computer vision, pp 203-212
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning, In: 2017 IEEE conference on computer vision and pattern recognition, pp 3242-3250
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning, In: 2017 IEEE conference on computer vision and pattern recognition, pp 6298-6306
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk, In: 2018 IEEE conference on computer vision and pattern recognition, pp 7219-7228
Johnson J, Krishna R, Stark M, Li L-J, Shamma DA, Bernstein MS, Li F-F (2017) Image retrieval using scene graphs, page numbers. In: 2017 IEEE conference on computer vision and pattern recognition, pp 3668-3678
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning, In: 2019 IEEE conference on computer vision and pattern recognition, pp 10685–10694
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning, computer vision-ECCV 2018-15th European conference, pp 711-727
Chen S, Jin Q, Wang P, Wu Q (2019) Say as you wish: fine-grained control of image caption generation with abstract scene graphs, In: 2020 IEEE conference on computer vision and pattern recognition, pp 9959-9968
Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning, 2020 IEEE/CVF conference on computer vision and pattern recognition, pp 10324-10333
Liu Sheng, Ren Zhou, Junsong Yuan (2021) Sibnet: Sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell 43(9):3259–72
Article Google Scholar
Gao Z, Wang Y, Xiong J, Pan Y, Huang Y (2020) Structural balance control of complex dynamical networks based on state observer for dynamic connection relationships. Complex. pp 5075487:1-5075487:9
Yu F, Zhang ZN, Liu L, Shen H, Huang Y, Shi C, Cai S, Song Y, Du S, Xu Q (2020) Secure communication scheme based on a new 5D multistable Four-Wing Memristive hyperchaotic system with disturbance inputs, Complex. pp 5859273:1-5859273:16
Xiang L, Guo G, Yu J, Sheng VS, Yang P (2020) A convolutional neural network-based linguistic steganalysis for synonym substitution steganography. Math Biosci Eng 17(2):1041–1058
Article MathSciNet MATH Google Scholar
Mahia RN, Fulwani DM (2018) On some input-output dynamic properties of complex networks. IEEE Trans Circuits Syst II: Express Briefs 65(2):216–220
Article Google Scholar
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey, CoRR arXiv:2003.08271
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Wang Yu, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon (2019) Unified language model pre-training for natural language understanding and generation. Annual Conf Neural Inf Process Syst 2019:13042–13054
Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift, In: Proceedings of the 32nd international conference on machine learning, pp 448-456
Ba LJ, Kiros JR, Hinton GE (2016) Layer normalization, CoRR arXiv:1607.06450
Ulyanov D, Vedaldi A, Lempitsky VS (2016) Instance normalization: the missing ingredient for fast stylization, CoRR arXiv:1607.08022
Wu Yuxin, He Kaiming (2020) Group Normalization. Int J Comput Vision 128(3):742–755
Article Google Scholar
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code, In: 2017 IEEE international conference on computer vision, pp 5561–5569
Ren S, He K, Girshick R, Sun J (2015) Faster RCNN: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Google Scholar
Schuster S, Krishna R, Chang A, Fei-Fei L, Manning CD (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval, In Proceedings of the fourth workshop on vision and language, pp 70–80
Miller George A (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41
Article Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, In: 2016 IEEE conference on computer vision and pattern recognition, pp 770-778, (2016)
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft COCO: Common objects in context, In Proceedings of the European conference on computer vision, pp 740–755
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73
Article MathSciNet Google Scholar
Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: Semantic propositional image caption evaluation, computer vision-ECCV 2016-14th European conference, pp 382-398
Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Denkowski MJ, Lavie A (2014) Meteor Universal: language specific translation evaluation for any target language, In: Proceedings of the Ninth workshop on statistical machine translation, pp 376-380
Lin C-Y, Hovy EH (2010) Rouge: a package for automatic evaluation of summaries, Workshop on Text summarization branches out at ACL
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation, In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311-318
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning, In Proceedings of the IEEE international conference on computer vision, pp 4634–4643
Yang Zhilin, Yuan Ye, Yuexin Wu, Cohen William W, Salakhutdinov Ruslan (2016) Review networks for caption generation. Annual Conf Neural Inf Process Syst 2016:2361–2369
Google Scholar
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning, 2017 IEEE conference on computer vision and pattern recognition, pp 1179-1195
Liu D, Zha Z-J, Zhang H, Zhang Y, Wu F (2018) Context-aware visual policy network for sequence-level image captioning, In 2018 ACM multimedia conference on multimedia conference, pp 1416– 1424
Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares (2019) Image captioning: transforming objects into words. Annual Conf Neural Inf Process Syst 2019:11135–11145
Google Scholar
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

Download references

Acknowledgements

This work is supported by Zhejiang Provincial Technical Plan Project(No.2020C03105, 2021C01129).

Author information

Authors and Affiliations

Hangzhou Dianzi University, Hangzhou, Baiyang Road 2, China
Pengfei Li, Min Zhang, Peijie Lin, Jian Wan & Ming Jiang

Authors

Pengfei Li
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Peijie Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wan
View author publications
You can also search for this author in PubMed Google Scholar
Ming Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, P., Zhang, M., Lin, P. et al. Conditional Embedding Pre-Training Language Model for Image Captioning. Neural Process Lett 54, 4987–5003 (2022). https://doi.org/10.1007/s11063-022-10844-3

Download citation

Accepted: 11 April 2022
Published: 14 June 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s11063-022-10844-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Conditional Embedding Pre-Training Language Model for Image Captioning

Abstract

Access this article

Similar content being viewed by others

TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning

A local representation-enhanced recurrent convolutional network for image captioning

Learning cross-modality features for image caption generation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Conditional Embedding Pre-Training Language Model for Image Captioning

Abstract

Access this article

Similar content being viewed by others

TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning

A local representation-enhanced recurrent convolutional network for image captioning

Learning cross-modality features for image caption generation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation