research-article

Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension

Authors:

Zheng-Jun ZhaAuthors Info & Claims

MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

Pages 14 - 22

https://doi.org/10.1145/3463945.3469055

Published: 27 August 2021 Publication History

Abstract

Referring expression comprehension (REC) is a multi-modal task that aims to localize target regions in images according to language descriptions. Existing methods can be concluded into two categories, proposal-based methods and proposal-free methods. Proposal-based methods first detect all candidate objects in the image and then retrieve the target among those objects based on the language description, while proposal-free methods directly locate the region based on the language without any region proposals. However, the proposal-based methods suffer from separate region proposal networks that actually do not suit this task well, and the proposal-free methods are not able to perform fine-grained visual-language alignments to yield higher precision. To overcome the above drawbacks, we propose a language-conditioned region proposal and retrieval network that first detects those regions only related to the language and then retrieves the target region by compositional reasoning on the language. Specifically, the proposed network consists of a language-conditioned region proposal network (LC-RPN) to detect those language-related regions, and a language-conditioned region retrieval network (LC-RRN) to perform region retrieval with a full understanding of the language. A pre-training mechanism is proposed to teach our model knowledge about language decomposing and vision-language alignment. Experimental results demonstrate that our proposed method achieves leading performance with high inference speed on RefCOCO, RefCOCO+, and RefCOCOg benchmarks.

References

[1]

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4971--4980.

[2]

Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, and Jiebo Luo. 2018. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018).

[3]

Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. 2018. Using syntax to ground referring expressions in natural images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[4]

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision. 764--773.

[5]

Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. 2019. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6569--6578.

[6]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[8]

Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, and Kate Saenko. 2019. Are you looking? grounding to multiple modalities in vision-and-language navigation. In ACL .

[9]

Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1115--1124.

[10]

Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4555--4564.

[11]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787--798.

[12]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, Vol. 123, 1 (2017), 32--73.

[13]

Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10880--10889.

[14]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.

[15]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[16]

Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-aware visual policy network for sequence-level image captioning. In Proceedings of the 26th ACM international conference on Multimedia. 1416--1424.

Digital Library

[17]

Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4673--4682.

[18]

Yongfei Liu, Bo Wan, Xiaodan Zhu, and Xuming He. 2020. Learning cross-modal context graph for visual grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11645--11652.

[19]

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11--20.

[20]

Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision. Springer, 792--807.

[21]

Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).

[22]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015).

[23]

Hao Wang, Zheng-Jun Zha, Liang Li, Dong Liu, and Jiebo Luo. 2021. Structured Multi-Level Interaction Network for Video Moment Localization via Language Query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7026--7035.

[24]

Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1960--1968.

[25]

Sibei Yang, Guanbin Li, and Yizhou Yu. 2019 b. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4644--4653.

[26]

Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020. Improving one-stage visual grounding by recursive sub-query construction. arXiv preprint arXiv:2008.01059 (2020).

[27]

Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019 a. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4683--4693.

[28]

Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. 2018b. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2403--2412.

[29]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018a. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307--1315.

[30]

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In European Conference on Computer Vision. Springer, 69--85.

[31]

Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7282--7290.

[32]

Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2019. Context-aware visual policy network for fine-grained image captioning. IEEE transactions on pattern analysis and machine intelligence (2019).

Digital Library

[33]

Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13278--13288.

[34]

Xingyi Zhou, Dequan Wang, and Philipp Kr"ahenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).

[35]

Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, and Hanwang Zhang. 2020. More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4777--4786.

[36]

C Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In European conference on computer vision. Springer, 391--405.

Cited By

Wang WPagnucco MXu CSong Y(2023)InterREC: An Interpretable Method for Referring Expression ComprehensionIEEE Transactions on Multimedia10.1109/TMM.2023.325111125(9330-9342)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3251111

Index Terms

Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Matching
        Object identification
      2. Computer vision tasks
        Scene understanding
    2. Natural language processing
      1. Information extraction
      2. Lexical semantics

Recommendations

Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Referring expression comprehension expects to accurately locate an object described by a language expression, which requires precise language-aware visual object representations. However, existing methods usually use rectangular object representations, ...
A deep learning-based and adaptive region proposal algorithm for semantic segmentation
Abstract
This paper presents an adaptive and new region proposal algorithm for generating high-quality regions. The main aim of this algorithm is to investigate different features in the proposal generation process. This algorithm is based on bottom-up ...
Highlights
- This paper proposes a new region proposal generation based on a hierarchical deep learning-based merging algorithm.
- The effectiveness and quality of some known texture-based descriptors are explored in the proposed algorithm.
- A new ...
A comprehensive and systematic review on classical and deep learning based region proposal algorithms
Abstract
Development of region proposal algorithms has rapidly become one of the most critical research areas over recent years. The perfect accuracy of region-based recognition techniques has led to the use of proposal algorithms as an ...
Highlights
- A comprehensive review of recent works of region proposal algorithms is presented.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

August 2021

60 pages

ISBN:9781450385305

DOI:10.1145/3463945

General Chairs:
Bei Liu
Microsoft Research Asia, China
,
Jianlong Fu
Microsoft Research Asia, China
,
Shizhe Chen
INRIA, France
,
Qin Jin
Renmin University of China, China
,
Alexander Hauptmann
Carnegie Mellon University, USA
,
Yong Rui
Lenovo Group, China

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMR '21

Sponsor:

SIGMM

ICMR '21: International Conference on Multimedia Retrieval

November 16 - 19, 2021

Taipei, Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
118
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang WPagnucco MXu CSong Y(2023)InterREC: An Interpretable Method for Referring Expression ComprehensionIEEE Transactions on Multimedia10.1109/TMM.2023.325111125(9330-9342)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3251111

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten