MSRC: multimodal spatial regression with semantic context for phrase grounding

Chen, Kan; Kovvuri, Rama; Gao, Jiyang; Nevatia, Ram

doi:10.1007/s13735-017-0139-6

MSRC: multimodal spatial regression with semantic context for phrase grounding

Regular Paper
Published: 16 November 2017

Volume 7, pages 17–28, (2018)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Kan Chen ORCID: orcid.org/0000-0003-1415-5495¹,
Rama Kovvuri¹,
Jiyang Gao¹ &
…
Ram Nevatia¹

544 Accesses
6 Citations
Explore all metrics

Abstract

Given a textual description of an image, phrase grounding localizes objects in the image referred by query phrases in the description. State-of-the-art methods treat phrase grounding as a ranking problem and address it by retrieving a set of proposals according to the query’s semantics, which are limited by the performance of independent proposal generation systems and ignore useful cues from context in the description. In this paper, we propose a novel multimodal spatial regression with semantic context (MSRC) system which not only predicts the location of ground truth based on proposal bounding boxes, but also refines prediction results by penalizing similarities of different queries coming from same sentences. There are two advantages of MSRC: First, it sidesteps the performance upper bound from independent proposal generation systems by adopting regression mechanism. Second, MSRC not only encodes the semantics of a query phrase, but also considers its relation with context (i.e., other queries from the same sentence) via a context refinement network. Experiments show MSRC system achieves a significant improvement in accuracy on two popular datasets: Flickr30K Entities and Refer-it Game, with 6.64 and 5.28% increase over the state of the arts, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

References

Andrej K, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence ZC, Parikh D (2015) Vqa: visual question answering. In: ICCV
Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2016) ABC-CNN: an attention based convolutional neural network for visual question answering. In: CVPR Workshop
Chen K, Bui T, Fang C, Wang Z, Nevatia R (2017) AMC: attention guided multi-modal correlation learning for image search. In: CVPR
Chen K, Kovvuri R, Nevatia R (2017) Query-guided regression network with context policy for phrase grounding. In: ICCV
Deng J, Dong W, Socher R, Li LJ, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: CVPR
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The PASCAL Visual Object Classes Challenge. In: IJCV
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, et al (2015) From captions to visual concepts and back. In: CVPR
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP
Girshick R (2015) Fast r-cnn. In: ICCV
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Aistats
Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: learning global representations for image search. In: ECCV
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: CVPR
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Article Google Scholar
Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) Natural language object retrieval. In: CVPR
Justin J, Andrej K, Li FF (2016) Densecap: fully convolutional localization networks for dense captioning. In: CVPR
Kazemzadeh S, Ordonez V, Matten M, Berg TL (2014) Referit game: referring to objects in photographs of natural scenes. In: EMNLP
Kantorov V, Oquab M, Cho M, Laptev I (2016) Contextlocnet: context-aware deep network models for weakly supervised localization. In: ECCV
Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS
Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR
Krishnamurthy J, Kollar T (2013) Jointly learning to parse and perceive: Connecting natural language to the physical world. In: TACL
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: ECCV
Matuszek C, FitzGerald N, Zettlemoyer L, Bo L, Fox D (2012) A joint model of language and perception for grounded attribute learning. In: ICML
Nagaraja VK, Morariu VI, Davis LS (2016) Modeling context between objects for referring expression understanding. In: ECCV
Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2016) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: IJCV
Radenović F, Tolias G, Chum O (2016) CNN image retrieval learns from bow: unsupervised fine-tuning with hard examples. In: ECCV
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified real-time object detection. In: CVPR
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: NIPS
Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) Grounding of textual phrases in images by reconstruction. In: ECCV
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: CoRR
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. In: IJCV
Wang M, Azab M, Kojima N, Mihalcea R, Deng J (2016) Structured matching for phrase localization. In: ECCV
Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. In: ECCV
Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: ECCV

Download references

Acknowledgements

This paper is based, in part, on research sponsored by the Air Force Research Laboratory and the Defense Advanced Research Projects Agency under Agreement No. FA8750-16-2-0204. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory and the Defense Advanced Research Projects Agency or the U.S. Government.

Author information

Authors and Affiliations

Institute for Robotics and Intelligent Systems, University of Southern California, Los Angeles, CA, 90089, USA
Kan Chen, Rama Kovvuri, Jiyang Gao & Ram Nevatia

Authors

Kan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Rama Kovvuri
View author publications
You can also search for this author in PubMed Google Scholar
Jiyang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Ram Nevatia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kan Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, K., Kovvuri, R., Gao, J. et al. MSRC: multimodal spatial regression with semantic context for phrase grounding. Int J Multimed Info Retr 7, 17–28 (2018). https://doi.org/10.1007/s13735-017-0139-6

Download citation

Received: 17 August 2017
Revised: 17 October 2017
Accepted: 07 November 2017
Published: 16 November 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s13735-017-0139-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MSRC: multimodal spatial regression with semantic context for phrase grounding

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MSRC: multimodal spatial regression with semantic context for phrase grounding

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation