research-article

Multi-level expression guided attention network for referring expression comprehension

Authors:

Xiaofeng ZhuAuthors Info & Claims

MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Article No.: 17, Pages 1 - 7

https://doi.org/10.1145/3444685.3446270

Published: 03 May 2021 Publication History

Abstract

Referring expression comprehension is a task of identifying a text-related object or region in a given image by a natural language expression. In this task, it is essential to understand the expression sentence in multi-aspect and adapt it to region representations for generating the discriminative information. Unfortunately, previous approaches usually focus on the important words or phrases in the expression using self-attention mechanisms, which causes that they may fail to distinguish the target region from others, especially the similar regions. To address this problem, we propose a novel model, termed Multi-level Expression Guided Attention network (MEGA-Net). It contains a multi-level visual attention schema guided by the expression representations in different levels, i.e., sentence-level, word-level and phrase-level, which allows generating the discriminative region features and helps to locate the related regions accurately. In addition, to distinguish the similar regions, we design a two-stage structure, where we first select top-K candidate regions according to their matching scores in the first stage, then we apply an object comparison attention mechanism to learn the difference between the candidates for matching the target region. We evaluate the proposed approach on three popular benchmark datasets and the experimental results demonstrate that our model performs against state-of-the-art methods.

References

[1]

Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2018. Describing video with attention-based bidirectional LSTM. IEEE Transactions on Cybernetics 49, 7 (2018), 2631--2641.

[2]

Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).

[3]

Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. 2018. Visual grounding via accumulated attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7746--7755.

[4]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.

Digital Library

[5]

Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1115--1124.

[6]

Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4555--4564.

[7]

He K, Zhang X, Ren S, and Sun J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[8]

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[9]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[10]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, and Pietro Perona. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. 740--755.

[11]

Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE International Conference on Computer Vision. 4673--4682.

[12]

Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, and Fanglin Wang. 2019. Referring Expression Grounding by Marginalizing Scene Graph Likelihood. arXiv preprint arXiv:1906.03561 (2019).

[13]

Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7102--7111.

[14]

Yadan Luo, Yang Yang, Fumin Shen, Zi Huang, Pan Zhou, and Heng Tao Shen. 2018. Robust discrete code modeling for supervised hashing. Pattern Recognition 75 (2018), 128--135.

Digital Library

[15]

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11--20.

[16]

Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision. Springer, 792--807.

[17]

Liang Peng, Yang Yang, Yi Bin, Ning Xie, Fumin Shen, Yanli Ji, and Xing Xu. 2019. Word-to-region attention network for visual question answering. Multimedia Tools and Application 78, 3 (2019), 3843--3858.

Digital Library

[18]

Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2020. MRA-Net: Improving VQA via Multi-modal Relation Attention Network. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).

[19]

Liang Peng, Yang Yang, Zheng Wang, Xiao Wu, and Zi Huang. 2019. CRA-Net: Composed Relation Attention Network for Visual Question Answering. In Proceedings of the ACM International Conference on Multimedia. 1202--1210.

Digital Library

[20]

Liang Peng, Yang Yang, Xiaopeng Zhang, Yanli Ji, Huimin Lu, and Heng Tao Shen. 2020. Answer Again: Imporving VQA with Cascaded-Answering Model. IEEE Transactions on Knowledge and Data Engineering (2020), 1--12.

[21]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing. 1532--1543.

[22]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.

[23]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.

[24]

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision. Springer, 817--834.

[25]

Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher D Manning. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language. 70--80.

[26]

H. T. Shen, L. Liu, Y. Yang, X. Xu, Z. Huang, F. Shen, and R. Hong. 2020. Exploiting Subspace Relation in Semantic Labels for Cross-modal Hashing. IEEE Transactions on Knowledge and Data Engineering (2020).

[27]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[28]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the ACM International Conference on Multimedia. 154--162.

Digital Library

[29]

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5005--5013.

[30]

Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1960--1968.

[31]

Shuai Wang, Fan Lyu, Wei Feng, and Song Wang. 2020. MUTATT: Visual-Textual Mutual Guidance for Referring Expression Comprehension. arXiv preprint arXiv:2003.08027 (2020).

[32]

Zheng Wang, Jie Zhou, Jing Ma, Jingjing Li, Jiangbo Ai, and Yang Yang. 2020. Discovering attractive segments in the user-generated video streams. Information Processing & Management 57, 1 (2020), 102130.

Digital Library

[33]

Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal Weighting Metric Learning for Cross-Modal Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13005--13014.

[34]

Jiwei Wei, Yang Yang, Jingjing Li, Lei Zhu, Lin Zuo, and Heng Tao Shen. 2019. Residual Graph Convolutional Networks for Zero-Shot Learning. In Proceedings of the ACM Multimedia Asia. 1--6.

Digital Library

[35]

X. Xu, F. Shen, Y. Yang, H. T. Shen, and X. Li. 2017. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Transactions on Image Processing 26, 5 (2017), 2494--2507.

Digital Library

[36]

X. Xu, T. Wang, Y. Yang, L. Zuo, F. Shen, and H. T. Shen. 2020. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE Transactions on Neural Networks and Learning Systems (2020), 1--14.

[37]

Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE International Conference on Computer Vision. 4644--4653.

[38]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4651--4659.

[39]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307--1315.

[40]

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In European Conference on Computer Vision. Springer, 69--85.

[41]

Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7282--7290.

[42]

Mingxing Zhang, Yang Yang, Hanwang Zhang, Yanli Ji, Heng Tao Shen, and Tat-Seng Chua. 2019. More is Better: Precise and Detailed Image Captioning using Online Positive Recall and Missing Concepts Mining. IEEE Transactions on Image Processing 28, 1 (2019), 32--44.

Digital Library

[43]

Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252--4261.

Cited By

Hang RXu SLiu Q(2024)A Regionally Indicated Visual Grounding Network for Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.349084762(1-11)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3490847

Index Terms

Multi-level expression guided attention network for referring expression comprehension
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing

Recommendations

Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension
MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

Referring expression comprehension (REC) is a multi-modal task that aims to localize target regions in images according to language descriptions. Existing methods can be concluded into two categories, proposal-based methods and proposal-free methods. ...
Investigating Gene and MicroRNA Expression in Glioblastoma
IJCBS '09: Proceedings of the 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing

Glioblastoma is the most common primary brain tumor in adults. Here we present an integrated analysis of microRNA expression and gene expression in 237 tumor tissues and 10 normal tissues. We indentified 1,236 genes, and 131 pathways significantly ...
Genome-wide co-expression based prediction of differential expressions

Motivation: Microarrays have been widely used for medical studies to detect novel disease-related genes. They enable us to study differential gene expressions at a genomic level. They also provide us with informative genome-wide co-expressions. Although ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

March 2021

512 pages

ISBN:9781450383080

DOI:10.1145/3444685

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Jingdong Wang
Microsoft Research
,
Qi Tian
Huawei Noah's Ark
,
Program Chairs:
Cathal Gurrin
Dublin City University
,
Jia Jia
Tsinghua University
,
Hanwang Zhang
Nanyang Technological University
,
Qianru Sun
Singapore Management University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MMAsia '20

Sponsor:

SIGMM

MMAsia '20: ACM Multimedia Asia

March 7, 2021

Virtual Event, Singapore

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
120
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)2

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hang RXu SLiu Q(2024)A Regionally Indicated Visual Grounding Network for Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.349084762(1-11)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3490847

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten