Improving Image Caption Performance with Linguistic Context

Cao, Yupeng; Wang, Qiu-Feng; Huang, Kaizhu; Zhang, Rui

doi:10.1007/978-3-030-39431-8_1

Yupeng Cao¹⁶,
Qiu-Feng Wang¹⁶,
Kaizhu Huang¹⁶ &
…
Rui Zhang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11691))

Included in the following conference series:

International Conference on Brain Inspired Cognitive Systems

1308 Accesses

Abstract

Image caption aims to generate a description of an image by using techniques of computer vision and natural language processing, where the framework of Convolutional Neural Networks (CNN) followed by Recurrent Neural Networks (RNN) or particularly LSTM, is widely used. In recent years, the attention-based CNN-LSTM networks attain the significant progress due to their ability of modelling global context. However, CNN-LSTMs do not consider the linguistic context explicitly, which is very useful in further boosting the performance. To overcome this issue, we proposed a method that integrate a n-gram model in the attention-based image caption framework, managing to model the word transition probability in the decoding process for enhancing the linguistic context of translation results. We evaluated the performance of BLEU on the benchmark dataset of MSCOCO 2014. Experimental results show the effectiveness of the proposed method. Specifically, the performance of BLEU-1, BLEU-2, BLEU-3 BLEU-4, and METEOR is improved by 0.2%, 0.7%, 0.6%, 0.5%, and 0.1, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Experimenting Encoder-Decoder Architecture for Visual Image Captioning

Effective Image Captioning Using Multi-layer LSTM with Attention Mechanism

Bangla Image Caption Generation Through CNN-Transformer Based Encoder-Decoder Network

Notes

1.
we get a higher score of CIDEr, but no results were reported on the other two methods in the literature.

References

Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Chapter Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375–383 (2017)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Han, J., Zhang, D., Hu, X., et al.: Background prior-based salient object detection via deep reconstruction residual. IEEE Trans. Circ. Syst. Video Technol. 25(8), 1309–1321 (2014)
Google Scholar
Yang, X., Huang, K., Zhang, R., et al.: A novel deep density model for unsupervised learning. Cogn. Comput. 11(6), 778–788 (2018)
Article Google Scholar
Wang, Z., Ren, J., Zhang, D., et al.: A deep-learning based feature hybrid framework for spatiotemporal saliency detection inside videos. Neurocomputing 287, 68–83 (2018)
Article Google Scholar
Chen, H., Ding, G., Lin, Z., Guo, Y., Han, J.: Attend to knowledge: memory-enhanced attention network for image captioning. In: Ren, J., et al. (eds.) BICS 2018. LNCS (LNAI), vol. 10989, pp. 161–171. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00563-4_16
Chapter Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. J. Mach. Learn. Res. 3(Feb), 1107–1135 (2003)
MATH Google Scholar
Kulkarni, G., Premraj, V., Ordonez, V., et al.: BabyTalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Article Google Scholar
Yang, Y., Teo, C.L., Daum III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 444–454. Association for Computational Linguistics, July 2011
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: The IEEE Conference on Computer Vision and Pattern Recognition, June 2015
Google Scholar
Yan, Y., Ren, J., Sun, G., et al.: Unsupervised image saliency detection with Gestalt-laws guided optimization and visual attention based refinement. Pattern Recogn. 79, 65–78 (2018)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Ellis, N.C.: Frequency effects in language processing: a review with implications for theories of implicit and explicit language acquisition. Stud. Second. Lang. Acquis. 24(2), 143–188 (2002)
Article Google Scholar
Wang, Q.-F., Yin, F., Liu, C.-L.: Integrating language model in handwritten Chinese text recognition. In: Proceedings of the 10th ICDAR, pp. 1036–1040 (2009)
Google Scholar
Wang, Q.F., Yin, F., Liu, C.L.: Handwritten Chinese text recognition by integrating multiple contexts. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1469–1481 (2011)
Article Google Scholar
Dong, H., Wang, W., Huang, K., et al.: Joint multi-label attention networks for social text annotation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1348–1354 (2019)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2nd edn. Pearson Prentice Hall, Upper Saddle River (2008)
Google Scholar
Goodman, J.T.: A bit of progress in language modeling: extended version. Technical report MSR-TR-2001-72, Microsoft Research (2001)
Google Scholar
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 384–394 (2010)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

Download references

Acknowledgements

The work was partially supported by the following: CCF-Tencent Open Research Fund RAGR20180109, National Natural Science Foundation of China under no. 61876155, and 61876154; The Natural Science Foundation of the Jiangsu Higher Education Institutions of China under no. 17KJD520010; Suzhou Science and Technology Program under no. SYG201712, SZS201613; Natural Science Foundation of Jiangsu Province BK20181189 and BK20181190; Key Program Special Fund in XJTLU under no. KSF-A-01, KSF-P-02, KSF-E-26, and KSF-A-10; XJTLU Research Development Fund RDF-16-02-49.

Author information

Authors and Affiliations

Xi’an Jiaotong-Liverpool University, No. 111 Renai Road, Suzhou, 215123, People’s Republic of China
Yupeng Cao, Qiu-Feng Wang, Kaizhu Huang & Rui Zhang

Authors

Yupeng Cao
View author publications
You can also search for this author in PubMed Google Scholar
Qiu-Feng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kaizhu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiu-Feng Wang .

Editor information

Editors and Affiliations

University of Strathclyde, Glasgow, UK
Jinchang Ren
Edinburgh Napier University, Edinburgh, UK
Amir Hussain
Guangdong Polytechnic Normal University, Guangzhou, China
Huimin Zhao
Xi’an Jiaotong-Liverpool University, Suzhou, China
Kaizhu Huang
Northwestern Polytechnical University, Xi'an, China
Jiangbin Zheng
Guangdong Polytechnic Normal University, Guangzhou, China
Jun Cai
Guangdong Polytechnic Normal University, Guangzhou, China
Rongjun Chen
Guangdong Polytechnic Normal University, Guangzhou, China
Yinyin Xiao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, Y., Wang, QF., Huang, K., Zhang, R. (2020). Improving Image Caption Performance with Linguistic Context. In: Ren, J., et al. Advances in Brain Inspired Cognitive Systems. BICS 2019. Lecture Notes in Computer Science(), vol 11691. Springer, Cham. https://doi.org/10.1007/978-3-030-39431-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-39431-8_1
Published: 01 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39430-1
Online ISBN: 978-3-030-39431-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Image Caption Performance with Linguistic Context

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Experimenting Encoder-Decoder Architecture for Visual Image Captioning

Effective Image Captioning Using Multi-layer LSTM with Attention Mechanism

Bangla Image Caption Generation Through CNN-Transformer Based Encoder-Decoder Network

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Improving Image Caption Performance with Linguistic Context

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Experimenting Encoder-Decoder Architecture for Visual Image Captioning

Effective Image Captioning Using Multi-layer LSTM with Attention Mechanism

Bangla Image Caption Generation Through CNN-Transformer Based Encoder-Decoder Network

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation