skip to main content
10.1145/2911996.2930060acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

Introducing Concept And Syntax Transition Networks for Image Captioning

Published: 06 June 2016 Publication History

Abstract

The area of image captioning i.e. the automatic generation of short textual descriptions of images has experienced much progress recently. However, image captioning approaches often only focus on describing the content of the image without any emotional or sentimental dimension which is common in human captions. This paper presents an approach for image captioning designed specifically to incorporate emotions and feelings into the caption generation process. The presented approach consists of a Deep Convolutional Neural Network (CNN) for detecting Adjective Noun Pairs in the image and a novel graphical network architecture called "Concept And Syntax Transition (CAST)" network for generating sentences from these detected concepts.

References

[1]
D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang. Large-scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs. In ACM Int. Conf. on Multimedia (ACM MM), 2013.
[2]
T. Chen, D. Borth, T. Darrell, and S.-F. Chang. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586, 2014.
[3]
X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325, 2015.
[4]
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In Computer Vision--ECCV 2010, pages 15--29. Springer, 2010.
[5]
M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, pages 853--899, 2013.
[6]
S. Kalkowski, C. Schulze, A. Dengel, and D. Borth. Real-time analysis and visualization of the yfcc100m dataset. In MM COMMOMS Workshop, 2015.
[7]
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128--3137, 2015.
[8]
S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale n-grams. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 220--228. ACL, 2011.
[9]
J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014.
[10]
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013.
[11]
R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for nding and describing images with sentences. Transactions of the ACL, 2:207--218, 2014.
[12]
B. Thomee, B. Elizalde, D. A. Shamma, K. Ni, G. Friedland, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64--73, 2016.
[13]
A. Ulges, D. Borth, and T. M. Breuel. Visual concept learning from weakly labeled web videos. In Video Search and Mining, pages 203--232. Springer, 2010.
[14]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555, 2014.

Cited By

View all
  • (2021)Integrating Historical States and Co-attention Mechanism for Visual Dialog2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412629(2041-2048)Online publication date: 10-Jan-2021
  • (2019)A Survey on Deep Learning in Image Polarity Detection: Balancing Generalization Performances and Computational CostsElectronics10.3390/electronics80707838:7(783)Online publication date: 12-Jul-2019
  • (2019)A State-of-Art Review on Automatic Video Annotation TechniquesIntelligent Systems Design and Applications10.1007/978-3-030-16657-1_99(1060-1069)Online publication date: 12-Apr-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval
June 2016
452 pages
ISBN:9781450343596
DOI:10.1145/2911996
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. auto caption
  2. image captioning

Qualifiers

  • Short-paper

Funding Sources

Conference

ICMR'16
Sponsor:
ICMR'16: International Conference on Multimedia Retrieval
June 6 - 9, 2016
New York, New York, USA

Acceptance Rates

ICMR '16 Paper Acceptance Rate 20 of 120 submissions, 17%;
Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Integrating Historical States and Co-attention Mechanism for Visual Dialog2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412629(2041-2048)Online publication date: 10-Jan-2021
  • (2019)A Survey on Deep Learning in Image Polarity Detection: Balancing Generalization Performances and Computational CostsElectronics10.3390/electronics80707838:7(783)Online publication date: 12-Jul-2019
  • (2019)A State-of-Art Review on Automatic Video Annotation TechniquesIntelligent Systems Design and Applications10.1007/978-3-030-16657-1_99(1060-1069)Online publication date: 12-Apr-2019
  • (2018)A Survey on Automatic Image CaptioningMathematics and Computing10.1007/978-981-13-0023-3_8(74-83)Online publication date: 14-Apr-2018
  • (2017)Social Multimedia Sentiment AnalysisProceedings of the 25th ACM international conference on Multimedia10.1145/3123266.3130143(1953-1954)Online publication date: 23-Oct-2017
  • (2016)Generating Affective Captions using Concept And Syntax Transition NetworksProceedings of the 24th ACM international conference on Multimedia10.1145/2964284.2984070(1111-1115)Online publication date: 1-Oct-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media