skip to main content
10.1145/2733373.2806338acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Rich Image Description Based on Regions

Published: 13 October 2015 Publication History

Abstract

Abstract Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In contrast to the previous image description methods that focus on describing the whole image, this paper presents a method of generating rich image descriptions from image regions. First, we detect regions with R-CNN (regions with convolutional neural network features) framework. We then utilize the RNN (recurrent neural networks) to generate sentences for image regions. Finally, we propose an optimization method to select one suitable region. The proposed model generates several sentence description of regions in an image, which has sufficient representative power of the whole image and contains more detailed information. Comparing to general image level description, generating more specific and accurate sentences on the different regions can satisfy more personal requirements for different people. Experimental evaluations validate the effectiveness of the proposed method.

References

[1]
Richard Socher andCliff Chiung-Yu Lin, Andrew Y. Ng, and Christopher D. Manning. Parsing natural scenes and natural language with recursive neural networks. In ICML, 2011.
[2]
Walter Daelemans, Mirella Lapata, and Lluıs Màrquez, editors. EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23--27, 2012. The Association for Computer Linguistics, 2012.
[3]
Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23--28, 2014, pages 580--587, 2014.
[4]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, 1997.
[5]
Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. (JAIR), 47:853--899, 2013.
[6]
Andrej Karpathy and Fei-Fei Li. Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306, 2014.
[7]
G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, Siming Li, Yejin Choi, A.C. Berg, and T.L. Berg. Babytalk: Understanding and generating simple image descriptions. TPAMI, 2013.
[8]
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
[9]
Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26--30, 2010, pages 1045--1048, 2010.
[10]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311--318, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics.
[11]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.
[12]
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[13]
Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, and Arnold W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154--171, 2013.
[14]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014.

Cited By

View all
  • (2019)Neighbouring Relationship Exploration Based on Graph Convolutional Network for Object Detection2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI)10.1109/ICUSAI47366.2019.9124840(178-183)Online publication date: Nov-2019
  • (2019)Attention-gated LSTM for Image Captioning2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI)10.1109/ICUSAI47366.2019.9124779(172-177)Online publication date: Nov-2019
  • (2017)VideoWhisper: Toward Discriminative Unsupervised Video Feature Learning With Attention-Based Recurrent Neural NetworksIEEE Transactions on Multimedia10.1109/TMM.2017.272268719:9(2080-2092)Online publication date: 1-Sep-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '15: Proceedings of the 23rd ACM international conference on Multimedia
October 2015
1402 pages
ISBN:9781450334594
DOI:10.1145/2733373
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. convolutional neural networks
  2. image description
  3. object detection
  4. recurrent neural networks
  5. region optimization

Qualifiers

  • Short-paper

Funding Sources

  • National Science Foundation of China
  • Lenovo Outstanding Young Scientists Program (LOYS)
  • National Basic Research Program of China (973 Program)
  • National Hi-Tech Development Program (863 Program) of China

Conference

MM '15
Sponsor:
MM '15: ACM Multimedia Conference
October 26 - 30, 2015
Brisbane, Australia

Acceptance Rates

MM '15 Paper Acceptance Rate 56 of 252 submissions, 22%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Neighbouring Relationship Exploration Based on Graph Convolutional Network for Object Detection2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI)10.1109/ICUSAI47366.2019.9124840(178-183)Online publication date: Nov-2019
  • (2019)Attention-gated LSTM for Image Captioning2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI)10.1109/ICUSAI47366.2019.9124779(172-177)Online publication date: Nov-2019
  • (2017)VideoWhisper: Toward Discriminative Unsupervised Video Feature Learning With Attention-Based Recurrent Neural NetworksIEEE Transactions on Multimedia10.1109/TMM.2017.272268719:9(2080-2092)Online publication date: 1-Sep-2017
  • (2017)A general description generator for human activity images based on deep understanding frameworkNeural Computing and Applications10.1007/s00521-015-2171-x28:8(2147-2163)Online publication date: 1-Aug-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media