research-article

Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks

Authors:
Jia Wang

National Yang Ming Chiao Tung University, Hsinchu, Taiwan

National Yang Ming Chiao Tung University, Hsinchu, Taiwan

0000-0002-0998-251X
View Profile

,
Jingcheng Ke

National Tsing Hua University, Hsinchu, Taiwan

National Tsing Hua University, Hsinchu, Taiwan

0000-0002-2262-6261
View Profile

,
Hong-Han Shuai

National Yang Ming Chiao Tung University, Hsinchu, Taiwan

National Yang Ming Chiao Tung University, Hsinchu, Taiwan

0000-0003-2216-077X
View Profile

,
Yung-Hui Li

Hon Hai Research Institute, Hsinchu, Taiwan

Hon Hai Research Institute, Hsinchu, Taiwan

0000-0002-0475-3689
View Profile

,
Wen-Huang Cheng

National Yang Ming Chiao Tung University, Hsinchu, Taiwan

National Yang Ming Chiao Tung University, Hsinchu, Taiwan

0000-0002-4662-7875
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19 Issue 2Article No.: 65pp 1–21https://doi.org/10.1145/3548688

Published:06 February 2023Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Referring expression comprehension aims to localize a specific object in an image according to a given language description. It is still challenging to comprehend and mitigate the gap between various types of information in the visual and textual domains. Generally, it needs to extract the salient features from a given expression and match the features of expression to an image. One challenge in referring expression comprehension is the number of region proposals generated by object detection methods is far more than the number of entities in the corresponding language description. Remarkably, the candidate regions without described by the expression will bring a severe impact on referring expression comprehension. To tackle this problem, we first propose a novel Enhanced Cross-modal Graph Attention Networks (ECMGANs) that boosts the matching between the expression and the entity position of an image. Then, an effective strategy named Graph Node Erase (GNE) is proposed to assist ECMGANs in eliminating the effect of irrelevant objects on the target object. Experiments on three public referring expression comprehension datasets show unambiguously that our ECMGANs framework achieves better performance than other state-of-the-art methods. Moreover, GNE is able to obtain higher accuracies of visual-expression matching effectively.

REFERENCES

[1] Bell Sean, Zitnick C. Lawrence, Bala Kavita, and Girshick Ross B.. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2874–2883.Google Scholar
[2] Berhich Asmae, Belouadha Fatima-Zahra, and Kabbaj Mohammed Issam. 2020. LSTM-based models for earthquake prediction. In Proceedings of the 3rd International Conference on Networking, Information Systems Security (NISS’20). Association for Computing Machinery, New York, NY, Article 46, 7 pages.Google ScholarDigital Library
[3] Chen Chaoqi, Zheng Zebiao, Huang Yue, Ding Xinghao, and Yu Yizhou. 2021. I3Net: Implicit instance-invariant network for adapting one-stage object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 12576–12585.Google ScholarCross Ref
[4] Chen Long, Ma Wenbo, Xiao Jun, Zhang Hanwang, and Chang Shih-Fu. 2021. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI’21), the 33rd Conference on Innovative Applications of Artificial Intelligence (IAAI’21), and the 11th Symposium on Educational Advances in Artificial Intelligence (EAAI’21). AAAI Press, 1036–1044.Google Scholar
[5] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2020. UNITER: UNiversal image-text representation learning. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 104–120.Google ScholarDigital Library
[6] Cheng Wen-Huang, Song Sijie, Chen Chieh-Yun, Hidayati Shintami Chusnul, and Liu Jiaying. 2021. Fashion meets computer vision: A survey. ACM Comput. Surv. 54, 4, Article 72 (jul 2021), 41 pages. DOI:Google ScholarDigital Library
[7] Cui Yuhao, Yu Zhou, Wang Chunqi, Zhao Zhongzhou, Zhang Ji, Wang Meng, and Yu Jun. 2021. ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-Modal Knowledge Integration. Association for Computing Machinery, New York, NY, 797–806. Google ScholarDigital Library
[8] Deng Chaorui, Wu Qi, Wu Qingyao, Hu Fuyuan, Lyu Fan, and Tan Mingkui. 2018. Visual grounding via accumulated attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7746–7755.Google Scholar
[9] Deng Jiajun, Yang Zhengyuan, Chen Tianlang, Zhou Wengang, and Li Houqiang. 2021. TransVG: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 1769–1779.Google ScholarCross Ref
[10] Fan Hehe, Zhu Linchao, Yang Yi, and Wu Fei. 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 78 (July 2020), 16 pages.Google ScholarDigital Library
[11] Fazal Muhammad Abu Ul, Ferguson Sam, and Johnston Andrew. 2020. Evaluation of information comprehension in concurrent speech-based designs. ACM Trans. Multimedia Comput. Commun. Appl. 16, 4, Article 129 (Dec. 2020), 19 pages.Google ScholarDigital Library
[12] Gao Peng, Lu Pan, Li Hongsheng, Li Shuang, Li Yikang, Hoi Steven C. H., and Wang Xiaogang. 2018. Question-guided hybrid convolution for visual question answering. Retrieved from https://arxiv.org/abs/1808.02632.Google Scholar
[13] Guzman Andrea L. and Lewis Seth C.. 2020. Artificial intelligence and communication: A human-machine communication research agenda. New Media Soc. 22, 1 (2020), 70–86.Google ScholarCross Ref
[14] Han Guangxing, He Yicheng, Huang Shiyuan, Ma Jiawei, and Chang Shih-Fu. 2021. Query adaptive few-shot object detection with heterogeneous graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 3263–3272.Google ScholarCross Ref
[15] He Dailan, Zhao Yusheng, Luo Junyu, Hui Tianrui, Huang Shaofei, Zhang Aixi, and Liu Si. 2021. TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. Association for Computing Machinery, New York, NY, 2344–2352.Google Scholar
[16] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780.Google ScholarDigital Library
[17] Hou Qibin, Jiang Peng-Tao, Wei Yunchao, and Cheng Ming-Ming. 2018. Self-erasing network for integral object attention. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates, Red Hook, NY, 547–557.Google ScholarDigital Library
[18] Hu R., Andreas J., Rohrbach M., Darrell T., and Saenko K.. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, Los Alamitos, CA, 804–813.Google ScholarCross Ref
[19] Hu Ronghang, Rohrbach Marcus, Andreas Jacob, Darrell Trevor, and Saenko Kate. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarCross Ref
[20] Hu R., Xu H., Rohrbach M., Feng J., Saenko K., and Darrell T.. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE Computer Society, Los Alamitos, CA, 4555–4564. DOI:Google ScholarCross Ref
[21] Hua Kai-Lung, Wang Hong-Cyuan, Yeh Chih-Hsiang, Cheng Wen-Huang, and Lai Yu-Chi. 2018. Background extraction using random walk image fusion. IEEE Trans. Cybernet. 48, 1 (2018), 423–435. DOI:Google ScholarCross Ref
[22] Huang Feiran, Wei Kaimin, Weng Jian, and Li Zhoujun. 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 79 (July 2020), 19 pages.Google ScholarDigital Library
[23] Islam Md. Zabirul, Islam Md. Milon, and Asraf Amanullah. 2020. A combined deep CNN-LSTM network for the detection of novel coronavirus (COVID-19) using X-ray images. Inform. Med. Unlocked 20 (2020), 100412.Google ScholarCross Ref
[24] Jing Chenchen, Wu Yuwei, Pei Mingtao, Hu Yao, Jia Yunde, and Wu Qi. 2020. Visual-Semantic Graph Matching for Visual Grounding. Association for Computing Machinery, New York, NY, 4041–4050.Google Scholar
[25] Kamath Aishwarya, Singh Mannat, LeCun Yann, Synnaeve Gabriel, Misra Ishan, and Carion Nicolas. 2021. MDETR—Modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 1780–1790.Google ScholarCross Ref
[26] Kazemzadeh Sahar, Ordonez Vicente, Matten Mark, and Berg Tamara. 2014. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 787–798.Google ScholarCross Ref
[27] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 1 (2017), 32–73.Google ScholarDigital Library
[28] Laitala Juho, Jiang Mingzhe, Syrjälä Elise, Naeini Emad Kasaeyan, Airola Antti, Rahmani Amir M., Dutt Nikil D., and Liljeberg Pasi. 2020. Robust ECG R-Peak Detection Using LSTM. Association for Computing Machinery, New York, NY, 1104–1111.Google Scholar
[29] Li Jianshu, Zhao Jian, Lang Congyan, Li Yidong, Wei Yunchao, Guo Guodong, Sim Terence, Yan Shuicheng, and Feng Jiashi. 2021. Multi-human parsing with a graph-based generative adversarial model. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1, Article 29 (Apr. 2021), 21 pages.Google ScholarDigital Library
[30] Li Linghui, Tang Sheng, Zhang Yongdong, Deng Lixi, and Tian Qi. 2018. GLA: Global-local attention for image description. IEEE Trans. Multimedia 20, 3 (2018), 726–737.Google ScholarDigital Library
[31] Li Liang, Zhu Xinge, Hao Yiming, Wang Shuhui, Gao Xingyu, and Huang Qingming. 2019. A hierarchical CNN-RNN approach for visual emotion classification. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3s, Article 97 (Dec. 2019), 17 pages.Google ScholarDigital Library
[32] Li Qianzhong, Zhang Yujia, Sun Shiying, Wu Jinting, Zhao Xiaoguang, and Tan Min. 2022. Cross-modality synergy network for referring expression comprehension and segmentation. Neurocomputing 467 (2022), 99–114.Google ScholarDigital Library
[33] Li Xiujun, Yin Xi, Li Chunyuan, Zhang Pengchuan, Hu Xiaowei, Zhang Lei, Wang Lijuan, Hu Houdong, Dong Li, Wei Furu, Choi Yejin, and Gao Jianfeng. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 121–137.Google ScholarDigital Library
[34] Liang Shengbin, Zhu Bin, Zhang Yuying, Cheng Suying, and Jin Jiangyong. 2020. A double channel CNN-LSTM model for text classification. In Proceedings of the IEEE 22nd International Conference on High Performance Computing and Communications, the IEEE 18th International Conference on Smart City, and the IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS’20). 1316–1321.Google Scholar
[35] Liao Yue, Liu Si, Li Guanbin, Wang Fei, Chen Yanjie, Qian Chen, and Li Bo. 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarCross Ref
[36] Liu J., Wang L., and Yang M.. 2017. Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, Los Alamitos, CA, 4866–4874. DOI:Google ScholarCross Ref
[37] Liu Xihui, Wang Zihao, Shao Jing, Wang Xiaogang, and Li Hongsheng. 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1950–1959.Google Scholar
[38] Liu Zhandong, Zhou Wengang, and Li Houqiang. 2019. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection. ACM Trans. Multimedia Comput. Commun. Appl. 15, 4, Article 107 (Dec. 2019), 23 pages.Google ScholarDigital Library
[39] Livieris Ioannis E., Kiriakidou Niki, Stavroyiannis Stavros, and Pintelas Panagiotis. 2021. An advanced CNN-LSTM model for cryptocurrency forecasting. Electronics 10, 3 (2021). DOI:Google ScholarCross Ref
[40] Lo L., Xie H., Shuai H., and Cheng W.. 2021. Facial chirality: Using self-face reflection to learn discriminative features for facial expression recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’21). IEEE Computer Society, Los Alamitos, CA, 1–6. DOI:Google ScholarCross Ref
[41] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Fox E., and Garnett R. (Eds.), Vol. 32. Curran Associates.Retrieved from https://proceedings.neurips.cc/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf.Google Scholar
[42] Luo G., Zhou Y., Sun X., Cao L., Wu C., Deng C., and Ji R.. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE Computer Society, Los Alamitos, CA, 10031–10040.Google ScholarCross Ref
[43] Luo R. and Shakhnarovich G.. 2017. Comprehension-guided referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE Computer Society, Los Alamitos, CA, 3125–3134. DOI:Google ScholarCross Ref
[44] Luo Xiaofan, Wong Fukoeng, and Hu Haifeng. 2020. FIN: Feature integrated network for object detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2, Article 48 (May 2020), 18 pages.Google ScholarDigital Library
[45] Ma Yu Jen, Shuai Hong Han, and Cheng Wen Huang. 2021. Spatiotemporal dilated convolution with uncertain matching for video-based crowd estimation. IEEE Trans. Multimedia (Jan. 2021). Retrieved from https://arxiv.org/abs/2101.12439. DOI:Google ScholarDigital Library
[46] Manning Christopher, Surdeanu Mihai, Bauer John, Finkel Jenny, Bethard Steven, and McClosky David. 2014. The stanford coreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 55–60.Google ScholarCross Ref
[47] Mao Junhua, Huang Jonathan, Toshev Alexander, Camburu Oana, Yuille Alan L., and Murphy Kevin. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 11–20. DOI:Google ScholarCross Ref
[48] Nagaraja Varun K., Morariu Vlad I., and Davis Larry S.. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the European Conference on Computer Vision (ECCV’16).Google ScholarCross Ref
[49] Ordonez Vicente, Kulkarni Girish, and Berg Tamara. 2011. Im2text: Describing images using 1 million captioned photographs. Adv. Neural Info. Process. Syst. 24 (2011).Google Scholar
[50] Ren Jimmy S. J., Chen Xiaohao, Liu Jianbo, Sun Wenxiu, Pang Jiahao, Yan Qiong, Tai Yu-Wing, and Xu Li. 2017. Accurate single stage detector using recurrent rolling convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), 752–760.Google Scholar
[51] Ren S., He K., Girshick R., and Sun J.. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (June 2017), 1137–1149. DOI:Google ScholarDigital Library
[52] Rohrbach Anna, Rohrbach Marcus, Hu Ronghang, Darrell Trevor, and Schiele Bernt. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV’16), Leibe Bastian, Matas Jiri, Sebe Nicu, and Welling Max (Eds.). Springer International Publishing, Cham, 817–834.Google ScholarCross Ref
[53] Sanchez-Riera Jordi, Hua Kai-Lung, Hsiao Yuan-Sheng, Lim Tekoing, Hidayati Shintami C., and Cheng Wen-Huang. 2016. A comparative study of data fusion for RGB-D based visual recognition. Pattern Recogn. Lett. 73 (2016), 1–6. DOI:Google ScholarDigital Library
[54] Scaiella Antonio, Croce Danilo, and Basili Roberto. 2019. Large scale datasets for image and video captioning in italian. Ital. J. Comput. Ling. 2, 5 (2019), 49–60.Google Scholar
[55] Sherstinsky Alex. 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlin. Phenom. 404 (2020), 132306.Google ScholarCross Ref
[56] Sio Chon Hou, Ma Yu-Jen, Shuai Hong-Han, Chen Jun-Cheng, and Cheng Wen-Huang. 2020. S2SiamFC: Self-Supervised Fully Convolutional Siamese Network for Visual Tracking. Association for Computing Machinery, New York, NY, 1948–1957.Google Scholar
[57] Song Sijie, Lin Xudong, Liu Jiaying, Guo Zongming, and Chang Shih-Fu. 2021. Co-grounding networks with semantic attention for referring expression comprehension in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 1346–1355.Google ScholarCross Ref
[58] Su Yu-Ting, Li Wen-Hui, Nie Wei-Zhi, and Liu An-An. 2020. Multi-view graph matching for 3D model retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 77 (2020), 20 pages. DOI:Google ScholarDigital Library
[59] Suo Wei, Sun MengYang, Wang Peng, and Wu Qi. 2021. Proposal-free one-stage referring expression via grid-word cross-attention. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI’21). International Joint Conferences on Artificial Intelligence Organization, 1032–1038. DOI:Google ScholarCross Ref
[60] Tsai Tsung-Hung, Cheng Wen-Huang, You Chuang-Wen, Hu Min-Chun, Tsui Arvin Wen, and Chi Heng-Yu. 2014. Learning and recognition of on-premise signs from weakly labeled street view images. IEEE Trans. Image Process. 23, 3 (2014), 1047–1059. DOI:Google ScholarDigital Library
[61] Wang Hong-Cyuan, Lai Yu-Chi, Cheng Wen-Huang, Cheng Chin-Yun, and Hua Kai-Lung. 2018. Background extraction based on joint Gaussian conditional random fields. IEEE Trans. Circ. Syst. Video Technol. 28, 11 (2018), 3127–3140. DOI:Google ScholarDigital Library
[62] Wang P., Wu Q., Cao J., Shen C., Gao L., and Hengel A.. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). IEEE Computer Society, Los Alamitos, CA, 1960–1968.Google ScholarCross Ref
[63] Wu Aming, Han Yahong, Zhu Linchao, and Yang Yi. 2021. Universal-prototype enhancing for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 9567–9576.Google ScholarCross Ref
[64] Xia Zi-Xiang, Lai Wei-Cheng, Tsao Li-Wu, Hsu Lien-Feng, Yu Chih-Chia Hu, Shuai Hong-Han, and Cheng Wen-Huang. 2021. A human-like traffic scene understanding system: A survey. IEEE Industr. Electr. Mag. 15, 1 (2021), 6–15. DOI:Google ScholarCross Ref
[65] Xie Hong-Xia, Lo Ling, Shuai Hong-Han, and Cheng Wen-Huang. 2020. AU-Assisted Graph Attention Convolutional Network for Micro-Expression Recognition. Association for Computing Machinery, New York, NY, 2871–2880.Google Scholar
[66] Xin Qi, Hu Shaohao, Liu Shuaiqi, Zhao Ling, and Wang Shuihua. 2022. WTRPNet: An explainable graph feature convolutional neural network for epileptic EEG classification. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3s, Article 107 (Dec. 2022), 18 pages.Google Scholar
[67] Xu Chunyan, Liu Rong, Zhang Tong, Cui Zhen, Yang Jian, and Hu Chunlong. 2021. Dual-stream structured graph convolution network for skeleton-based action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4, Article 120 (Nov. 2021), 22 pages.Google ScholarDigital Library
[68] Xu Xing, Wang Yifan, He Yixuan, Yang Yang, Hanjalic Alan, and Shen Heng Tao. 2021. Cross-modal hybrid feature fusion for image-sentence matching. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4, Article 127 (Nov. 2021), 23 pages. DOI:Google ScholarDigital Library
[69] Yang Jianwei, Lu Jiasen, Lee Stefan, Batra Dhruv, and Parikh Devi. 2018. Graph R-CNN for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarDigital Library
[70] Yang S., Li G., and Yu Y.. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 4140–4149.Google ScholarCross Ref
[71] Yang S., Li G., and Yu Y.. 2019. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE Computer Society, 4643–4652. DOI:Google ScholarCross Ref
[72] Yang Sibei, Li Guanbin, and Yu Yizhou. 2020. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarCross Ref
[73] Yu Licheng, Lin Zhe L., Shen Xiaohui, Yang Jimei, Lu Xin, Bansal Mohit, and Berg Tamara L.. 2018. MAttNet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1307–1315.Google Scholar
[74] Yu Licheng, Poirson Patrick, Yang Shan, Berg Alexander C., and Berg Tamara L.. 2016. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer International Publishing, Cham, 69–85.Google ScholarCross Ref
[75] Yu Licheng, Tan Hao, Bansal Mohit, and Berg Tamara L.. 2017. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3521–3529.Google Scholar
[76] Zhang Hanwang, Niu Yulei, and Chang Shih-Fu. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4158–4166.Google Scholar
[77] Zhang Wenqiao, Tang Siliang, Cao Yanpeng, Pu Shiliang, Wu Fei, and Zhuang Yueting. 2020. Frame augmented alternating attention network for video question answering. IEEE Trans. Multimedia 22, 4 (2020), 1032–1041.Google ScholarDigital Library
[78] Zhuang Bohan, Wu Qi, Shen Chunhua, Reid Ian, and Hengel Anton van den. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 4252–4261.Google ScholarCross Ref

Index Terms

Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks

Recommendations

Towards Further Comprehension on Referring Expression with Rationale
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Referring Expression Comprehension (REC) is one important research branch in visual grounding, where the goal of REC is to localize a relevant object in the image, given an expression in the form of text to exactly describe a specific object. However, ...
Read More
Exploring Logical Reasoning for Referring Expression Comprehension
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Referring expression comprehension aims to localize the target object in an image referred by a natural language expression. Most existing approaches neglect the implicit logical correlations among fine-grained cues, e.g., categories, attributes, which ...
Read More
TextREC: A Dataset for Referring Expression Comprehension with Reading Comprehension
Document Analysis and Recognition - ICDAR 2023
Abstract
Referring expression comprehension (REC) aims at locating a specific object within a scene given a natural language expression. Although referring expression comprehension has achieved tremendous progress, most of today’s REC models ignore the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 2
March 2023
540 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3572860
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 February 2023
- Online AM: 15 July 2022
- Accepted: 27 June 2022
- Revised: 18 May 2022
- Received: 4 January 2022
Published in tomm Volume 19, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Referring expression comprehension
object detection
Enhanced Cross-modal Graph Attention Networks
Graph Node Erase
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 585
  Total Downloads
- Downloads (Last 12 months)378
- Downloads (Last 6 weeks)41
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Towards Further Comprehension on Referring Expression with Rationale

Exploring Logical Reasoning for Referring Expression Comprehension

TextREC: A Dataset for Referring Expression Comprehension with Reading Comprehension