short-paper

Rich Image Description Based on Regions

Authors:

Jianbin JiaoAuthors Info & Claims

MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Pages 1315 - 1318

https://doi.org/10.1145/2733373.2806338

Published: 13 October 2015 Publication History

Get Access

Abstract

Abstract Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In contrast to the previous image description methods that focus on describing the whole image, this paper presents a method of generating rich image descriptions from image regions. First, we detect regions with R-CNN (regions with convolutional neural network features) framework. We then utilize the RNN (recurrent neural networks) to generate sentences for image regions. Finally, we propose an optimization method to select one suitable region. The proposed model generates several sentence description of regions in an image, which has sufficient representative power of the whole image and contains more detailed information. Comparing to general image level description, generating more specific and accurate sentences on the different regions can satisfy more personal requirements for different people. Experimental evaluations validate the effectiveness of the proposed method.

References

[1]

Richard Socher andCliff Chiung-Yu Lin, Andrew Y. Ng, and Christopher D. Manning. Parsing natural scenes and natural language with recursive neural networks. In ICML, 2011.

Digital Library

Google Scholar

[2]

Walter Daelemans, Mirella Lapata, and Lluıs Màrquez, editors. EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23--27, 2012. The Association for Computer Linguistics, 2012.

Digital Library

Google Scholar

[3]

Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23--28, 2014, pages 580--587, 2014.

Digital Library

Google Scholar

[4]

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, 1997.

Digital Library

Google Scholar

[5]

Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. (JAIR), 47:853--899, 2013.

Digital Library

Google Scholar

[6]

Andrej Karpathy and Fei-Fei Li. Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306, 2014.

Google Scholar

[7]

G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, Siming Li, Yejin Choi, A.C. Berg, and T.L. Berg. Babytalk: Understanding and generating simple image descriptions. TPAMI, 2013.

Digital Library

Google Scholar

[8]

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.

Google Scholar

[9]

Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26--30, 2010, pages 1045--1048, 2010.

Crossref

Google Scholar

[10]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311--318, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics.

Digital Library

Google Scholar

[11]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.

Google Scholar

[12]

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.

Google Scholar

[13]

Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, and Arnold W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154--171, 2013.

Digital Library

Google Scholar

[14]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014.

Google Scholar

Cited By

View all

Du ZJi JWang JHu QGao SZhang X(2019)Neighbouring Relationship Exploration Based on Graph Convolutional Network for Object Detection2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI)10.1109/ICUSAI47366.2019.9124840(178-183)Online publication date: Nov-2019
https://doi.org/10.1109/ICUSAI47366.2019.9124840
Xu CJi JZhang MZhang X(2019)Attention-gated LSTM for Image Captioning2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI)10.1109/ICUSAI47366.2019.9124779(172-177)Online publication date: Nov-2019
https://doi.org/10.1109/ICUSAI47366.2019.9124779
Zhao NZhang HHong RWang MChua T(2017)VideoWhisper: Toward Discriminative Unsupervised Video Feature Learning With Attention-Based Recurrent Neural NetworksIEEE Transactions on Multimedia10.1109/TMM.2017.272268719:9(2080-2092)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1109/TMM.2017.2722687
Show More Cited By

Index Terms

Rich Image Description Based on Regions
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition
  2. Computer graphics
    1. Graphics systems and interfaces
      1. Perception

Recommendations

Description of interest regions with local binary patterns

This paper presents a novel method for interest region description. We adopted the idea that the appearance of an interest region can be well characterized by the distribution of its local features. The most well-known descriptor built on this idea is ...
Improved polar complex exponential transform for robust local image description
Highlights
- We propose a novel distinctive local image descriptor based on the phase and amplitude information of Polar Complex Exponential Transform.
- The proposed descriptor improves the robustness to common photometric and geometric ...
Abstract
Image description via robust local descriptors plays a vital role in a large number of image representation and matching applications. In this paper, we propose a novel distinctive local image descriptor that is based on the phase and amplitude ...
Image Descriptors Based on the Edge Orientation
SMAP '09: Proceedings of the 2009 Fourth International Workshop on Semantic Media Adaptation and Personalization

Edges are one of the most important image visual features. They are highly related with shapes and can also be representative of the image textures. Edge orientation histograms are usually very reliable descriptors suitable for image analysis, search ...

Comments

Information & Contributors

Information

Published In

MM '15: Proceedings of the 23rd ACM international conference on Multimedia

October 2015

1402 pages

ISBN:9781450334594

DOI:10.1145/2733373

General Chairs:
Xiaofang Zhou
The University of Queensland, Australia
,
Alan F. Smeaton
Dublin City University, Ireland
,
Qi Tian
The University of Texas at San Antonio, USA
,
Program Chairs:
Dick C.A. Bulterman
FXPAL, USA
,
Heng Tao Shen
The University of Queensland, Australia
,
Ketan Mayer-Patel
The University of North Carolina, USA
,
Shuicheng Yan
National University of Singapore, Singapore

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

National Science Foundation of China
Lenovo Outstanding Young Scientists Program (LOYS)
National Basic Research Program of China (973 Program)
National Hi-Tech Development Program (863 Program) of China

Conference

MM '15

Sponsor:

SIGMM

MM '15: ACM Multimedia Conference

October 26 - 30, 2015

Brisbane, Australia

Acceptance Rates

MM '15 Paper Acceptance Rate 56 of 252 submissions, 22%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
178
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Du ZJi JWang JHu QGao SZhang X(2019)Neighbouring Relationship Exploration Based on Graph Convolutional Network for Object Detection2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI)10.1109/ICUSAI47366.2019.9124840(178-183)Online publication date: Nov-2019
https://doi.org/10.1109/ICUSAI47366.2019.9124840
Xu CJi JZhang MZhang X(2019)Attention-gated LSTM for Image Captioning2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI)10.1109/ICUSAI47366.2019.9124779(172-177)Online publication date: Nov-2019
https://doi.org/10.1109/ICUSAI47366.2019.9124779
Zhao NZhang HHong RWang MChua T(2017)VideoWhisper: Toward Discriminative Unsupervised Video Feature Learning With Attention-Based Recurrent Neural NetworksIEEE Transactions on Multimedia10.1109/TMM.2017.272268719:9(2080-2092)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1109/TMM.2017.2722687
Zhou ZLi KBai L(2017)A general description generator for human activity images based on deep understanding frameworkNeural Computing and Applications10.1007/s00521-015-2171-x28:8(2147-2163)Online publication date: 1-Aug-2017
https://dl.acm.org/doi/10.1007/s00521-015-2171-x

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Description of interest regions with local binary patterns

Improved polar complex exponential transform for robust local image description

Image Descriptors Based on the Edge Orientation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations