skip to main content
10.1145/3126686.3126717acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Public Access

Watch What You Just Said: Image Captioning with Text-Conditional Attention

Published: 23 October 2017 Publication History

Abstract

Attention mechanisms have attracted considerable interest in image captioning due to their powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called text-conditional attention, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results show that our method outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of our text-conditional attention in image captioning.

References

[1]
Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Darrell. 2016. Deep Compositional Captioning: Describing Novel Object Categories Without Paired Training Data. In IEEE Conference on Computer Vision and Pattern Recognition.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
[3]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Empirical Methods in Natural Language Processing.
[4]
Pradipto Das, Rohini K Srihari, and Jason J. Corso. 2013. Translating related words to videos and back through latent topics. In WSDM.
[5]
Robert Desimone and John Duncan. 1995. Neural mechanisms of selective visual attention. Annual Review of Neuroscience 18, 1 (1995), 193--222.
[6]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition.
[7]
Desmond Elliott and Frank Keller. 2013. Image Description using Visual Dependency Representations. In Empirical Methods in Natural Language Processing.
[8]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. 2015. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition.
[9]
Ali Farhadi, Mohsen Hejrati, Mohammad Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision.
[10]
Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision.
[11]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In IEEE International Conference on Computer Vision.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.
[14]
Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding the Long-Short Term Memory Model for Image Caption Generation. In IEEE International Conference on Computer Vision.
[15]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In IEEE Conference on Computer Vision and Pattern Recognition.
[16]
Andrej Karpathy. 2015. neuraltalk2. https://github.com/karpathy/neuraltalk2. (2015).
[17]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition.
[18]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[19]
Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903.
[20]
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Association for Computational Linguistics.
[21]
Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Computational Natural Language Learning.
[22]
Xiangyang Li, Xinhang Song, Luis Herranz, Yaohui Zhu, and Jiang Shuqiang. 2016. Image Captioning with both Object and Scene Information. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1107--1110.
[23]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision.
[24]
Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In European Conference on Computer Vision.
[25]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn). In International Conference on Learning Representations.
[26]
Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. 2012. Midge: Generating image descriptions from computer vision detections. In EACL. Citeseer.
[27]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.
[28]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
[29]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence learning with neural networks. In Advances in Neural Information Processing Systems.
[30]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.
[31]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition.
[32]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence (2016).
[33]
Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 988--997.
[34]
Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In IEEE Conference on Computer Vision and Pattern Recognition.
[35]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning.
[36]
Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 444--454.
[37]
Quanzeng You, Liangliang Cao, Hailin Jin, and Jiebo Luo. 2016. Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1008--1017.
[38]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning With Semantic Attention. In IEEE Conference on Computer Vision and Pattern Recognition.

Cited By

View all
  • (2025)Automated Image Caption Generator for Visually Impaired Using VGG16 and LSTMAdvances in Data and Information Sciences10.1007/978-981-97-7360-2_11(109-119)Online publication date: 3-Jan-2025
  • (2024)Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text FeedbackIEEE Transactions on Multimedia10.1109/TMM.2024.341769426(9936-9948)Online publication date: 1-Jan-2024
  • (2024)Brush2Prompt: Contextual Prompt Generator for Object Inpainting2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01201(12636-12645)Online publication date: 16-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017
October 2017
558 pages
ISBN:9781450354165
DOI:10.1145/3126686
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. image captioning
  2. lstm
  3. multi-modal embedding
  4. neural network

Qualifiers

  • Research-article

Funding Sources

  • ARO
  • DARPA
  • NSF NRI

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 23 - 27, 2017
California, Mountain View, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)122
  • Downloads (Last 6 weeks)16
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Automated Image Caption Generator for Visually Impaired Using VGG16 and LSTMAdvances in Data and Information Sciences10.1007/978-981-97-7360-2_11(109-119)Online publication date: 3-Jan-2025
  • (2024)Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text FeedbackIEEE Transactions on Multimedia10.1109/TMM.2024.341769426(9936-9948)Online publication date: 1-Jan-2024
  • (2024)Brush2Prompt: Contextual Prompt Generator for Object Inpainting2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01201(12636-12645)Online publication date: 16-Jun-2024
  • (2024)A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future PerspectivesArchives of Computational Methods in Engineering10.1007/s11831-024-10190-8Online publication date: 16-Oct-2024
  • (2024)A Survey on Adversarial Text Attacks on Deep Learning Models in Natural Language ProcessingProceedings of the 5th International Conference on Data Science, Machine Learning and Applications; Volume 110.1007/978-981-97-8031-0_111(1059-1067)Online publication date: 6-Oct-2024
  • (2023)Deep Learning Approaches on Image Captioning: A ReviewACM Computing Surveys10.1145/361759256:3(1-39)Online publication date: 5-Oct-2023
  • (2023)Nested Attention Network with Graph Filtering for Visual Question and AnsweringICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096849(1-5)Online publication date: 4-Jun-2023
  • (2023)Image captioning based on scene graphs: A surveyExpert Systems with Applications10.1016/j.eswa.2023.120698231(120698)Online publication date: Nov-2023
  • (2023)Image captioning using transformer-based double attention networkEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106545125(106545)Online publication date: Oct-2023
  • (2023)From methods to datasets: A survey on Image-Caption GeneratorsMultimedia Tools and Applications10.1007/s11042-023-16560-x83:9(28077-28123)Online publication date: 31-Aug-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media