research-article

Public Access

Watch What You Just Said: Image Captioning with Text-Conditional Attention

Authors:

Jason J. CorsoAuthors Info & Claims

Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

Pages 305 - 313

https://doi.org/10.1145/3126686.3126717

Published: 23 October 2017 Publication History

Abstract

Attention mechanisms have attracted considerable interest in image captioning due to their powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called text-conditional attention, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results show that our method outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of our text-conditional attention in image captioning.

References

[1]

Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Darrell. 2016. Deep Compositional Captioning: Describing Novel Object Categories Without Paired Training Data. In IEEE Conference on Computer Vision and Pattern Recognition.

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.

[3]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Empirical Methods in Natural Language Processing.

[4]

Pradipto Das, Rohini K Srihari, and Jason J. Corso. 2013. Translating related words to videos and back through latent topics. In WSDM.

Digital Library

[5]

Robert Desimone and John Duncan. 1995. Neural mechanisms of selective visual attention. Annual Review of Neuroscience 18, 1 (1995), 193--222.

[6]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition.

[7]

Desmond Elliott and Frank Keller. 2013. Image Description using Visual Dependency Representations. In Empirical Methods in Natural Language Processing.

[8]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. 2015. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition.

[9]

Ali Farhadi, Mohsen Hejrati, Mohammad Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision.

Digital Library

[10]

Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision.

[11]

Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In IEEE International Conference on Computer Vision.

Digital Library

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition.

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.

Digital Library

[14]

Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding the Long-Short Term Memory Model for Image Caption Generation. In IEEE International Conference on Computer Vision.

Digital Library

[15]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In IEEE Conference on Computer Vision and Pattern Recognition.

[16]

Andrej Karpathy. 2015. neuraltalk2. https://github.com/karpathy/neuraltalk2. (2015).

[17]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition.

[18]

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[19]

Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903.

Digital Library

[20]

Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Association for Computational Linguistics.

Digital Library

[21]

Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Computational Natural Language Learning.

Digital Library

[22]

Xiangyang Li, Xinhang Song, Luis Herranz, Yaohui Zhu, and Jiang Shuqiang. 2016. Image Captioning with both Object and Scene Information. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1107--1110.

Digital Library

[23]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision.

[24]

Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In European Conference on Computer Vision.

[25]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn). In International Conference on Learning Representations.

[26]

Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. 2012. Midge: Generating image descriptions from computer vision detections. In EACL. Citeseer.

Digital Library

[27]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.

Digital Library

[28]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.

[29]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence learning with neural networks. In Advances in Neural Information Processing Systems.

Digital Library

[30]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.

[31]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition.

[32]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence (2016).

Digital Library

[33]

Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 988--997.

Digital Library

[34]

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In IEEE Conference on Computer Vision and Pattern Recognition.

[35]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning.

Digital Library

[36]

Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 444--454.

Digital Library

[37]

Quanzeng You, Liangliang Cao, Hailin Jin, and Jiebo Luo. 2016. Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1008--1017.

Digital Library

[38]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning With Semantic Attention. In IEEE Conference on Computer Vision and Pattern Recognition.

Cited By

Vigyat KDhavakumar P(2025)Automated Image Caption Generator for Visually Impaired Using VGG16 and LSTMAdvances in Data and Information Sciences10.1007/978-981-97-7360-2_11(109-119)Online publication date: 3-Jan-2025
https://doi.org/10.1007/978-981-97-7360-2_11
Xu YBin YWei JYang YWang GShen H(2024)Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text FeedbackIEEE Transactions on Multimedia10.1109/TMM.2024.341769426(9936-9948)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3417694
Chiu MZhou YZhang LLin ZBarnes CAmirghodsi SShechtman EShi H(2024)Brush2Prompt: Contextual Prompt Generator for Object Inpainting2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01201(12636-12645)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01201
Show More Cited By

Index Terms

Watch What You Just Said: Image Captioning with Text-Conditional Attention
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
    2. Natural language processing
      1. Natural language generation
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Image Captioning with Deep Bidirectional LSTMs
MM '16: Proceedings of the 24th ACM international conference on Multimedia

This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning long term ...
Attention based sequence-to-sequence framework for auto image caption generation

Caption generation using an encoder-decoder approach has recently been extensively studied and implemented in various domains, including image captioning and code captioning. In this research article, we propose one particular approach for completing a ...
Neural attention for image captioning: review of outstanding methods
Abstract
Image captioning is the task of automatically generating sentences that describe an input image in the best way possible. The most successful techniques for automatically generating image captions have recently used attentive deep learning models. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

October 2017

558 pages

ISBN:9781450354165

DOI:10.1145/3126686

Program Chairs:
Wanmin Wu
Google, USA
,
Jianchao Yang
Snap Inc., USA
,
Qi Tian
The University of Texas at San Antonio, USA
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ARO
DARPA
NSF NRI

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23 - 27, 2017

California, Mountain View, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
716
Total Downloads

Downloads (Last 12 months)122
Downloads (Last 6 weeks)16

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vigyat KDhavakumar P(2025)Automated Image Caption Generator for Visually Impaired Using VGG16 and LSTMAdvances in Data and Information Sciences10.1007/978-981-97-7360-2_11(109-119)Online publication date: 3-Jan-2025
https://doi.org/10.1007/978-981-97-7360-2_11
Xu YBin YWei JYang YWang GShen H(2024)Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text FeedbackIEEE Transactions on Multimedia10.1109/TMM.2024.341769426(9936-9948)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3417694
Chiu MZhou YZhang LLin ZBarnes CAmirghodsi SShechtman EShi H(2024)Brush2Prompt: Contextual Prompt Generator for Object Inpainting2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01201(12636-12645)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01201
Salgotra GAbrol PSelwal A(2024)A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future PerspectivesArchives of Computational Methods in Engineering10.1007/s11831-024-10190-8Online publication date: 16-Oct-2024
https://doi.org/10.1007/s11831-024-10190-8
Deepan STorres-Cruz FPlácido-Lerma RUdhayakumar RAnuradha SKapila D(2024)A Survey on Adversarial Text Attacks on Deep Learning Models in Natural Language ProcessingProceedings of the 5th International Conference on Data Science, Machine Learning and Applications; Volume 110.1007/978-981-97-8031-0_111(1059-1067)Online publication date: 6-Oct-2024
https://doi.org/10.1007/978-981-97-8031-0_111
Ghandi TPourreza HMahyar H(2023)Deep Learning Approaches on Image Captioning: A ReviewACM Computing Surveys10.1145/361759256:3(1-39)Online publication date: 5-Oct-2023
https://dl.acm.org/doi/10.1145/3617592
Lu JWu CWang LYuan SWu J(2023)Nested Attention Network with Graph Filtering for Visual Question and AnsweringICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096849(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096849
Jia JDing XPang SGao XXin XHu RNie J(2023)Image captioning based on scene graphs: A surveyExpert Systems with Applications10.1016/j.eswa.2023.120698231(120698)Online publication date: Nov-2023
https://doi.org/10.1016/j.eswa.2023.120698
Parvin HReza Naghsh-Nilchi AMahvash Mohammadi H(2023)Image captioning using transformer-based double attention networkEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106545125(106545)Online publication date: Oct-2023
https://doi.org/10.1016/j.engappai.2023.106545
Agarwal LVerma B(2023)From methods to datasets: A survey on Image-Caption GeneratorsMultimedia Tools and Applications10.1007/s11042-023-16560-x83:9(28077-28123)Online publication date: 31-Aug-2023
https://doi.org/10.1007/s11042-023-16560-x
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents