Abstract
Referring expression comprehension aims to localize a specific object in an image according to a given language description. It is still challenging to comprehend and mitigate the gap between various types of information in the visual and textual domains. Generally, it needs to extract the salient features from a given expression and match the features of expression to an image. One challenge in referring expression comprehension is the number of region proposals generated by object detection methods is far more than the number of entities in the corresponding language description. Remarkably, the candidate regions without described by the expression will bring a severe impact on referring expression comprehension. To tackle this problem, we first propose a novel Enhanced Cross-modal Graph Attention Networks (ECMGANs) that boosts the matching between the expression and the entity position of an image. Then, an effective strategy named Graph Node Erase (GNE) is proposed to assist ECMGANs in eliminating the effect of irrelevant objects on the target object. Experiments on three public referring expression comprehension datasets show unambiguously that our ECMGANs framework achieves better performance than other state-of-the-art methods. Moreover, GNE is able to obtain higher accuracies of visual-expression matching effectively.
- [1] . 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2874–2883.Google Scholar
- [2] . 2020. LSTM-based models for earthquake prediction. In Proceedings of the 3rd International Conference on Networking, Information Systems Security (NISS’20). Association for Computing Machinery, New York, NY, Article
46 , 7 pages.Google ScholarDigital Library - [3] . 2021. I3Net: Implicit instance-invariant network for adapting one-stage object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 12576–12585.Google ScholarCross Ref
- [4] . 2021. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI’21), the 33rd Conference on Innovative Applications of Artificial Intelligence (IAAI’21), and the 11th Symposium on Educational Advances in Artificial Intelligence (EAAI’21). AAAI Press, 1036–1044.Google Scholar
- [5] . 2020. UNITER: UNiversal image-text representation learning. In Proceedings of the European Conference on Computer Vision (ECCV’20), , , , and (Eds.). Springer International Publishing, Cham, 104–120.Google ScholarDigital Library
- [6] . 2021. Fashion meets computer vision: A survey. ACM Comput. Surv. 54, 4, Article
72 (jul 2021), 41 pages.DOI: Google ScholarDigital Library - [7] . 2021. ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-Modal Knowledge Integration. Association for Computing Machinery, New York, NY, 797–806. Google ScholarDigital Library
- [8] . 2018. Visual grounding via accumulated attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7746–7755.Google Scholar
- [9] . 2021. TransVG: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 1769–1779.Google ScholarCross Ref
- [10] . 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article
78 (July 2020), 16 pages.Google ScholarDigital Library - [11] . 2020. Evaluation of information comprehension in concurrent speech-based designs. ACM Trans. Multimedia Comput. Commun. Appl. 16, 4, Article
129 (Dec. 2020), 19 pages.Google ScholarDigital Library - [12] . 2018. Question-guided hybrid convolution for visual question answering. Retrieved from https://arxiv.org/abs/1808.02632.Google Scholar
- [13] . 2020. Artificial intelligence and communication: A human-machine communication research agenda. New Media Soc. 22, 1 (2020), 70–86.Google ScholarCross Ref
- [14] . 2021. Query adaptive few-shot object detection with heterogeneous graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 3263–3272.Google ScholarCross Ref
- [15] . 2021. TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. Association for Computing Machinery, New York, NY, 2344–2352.Google Scholar
- [16] . 1997. Long short-term memory. Neural Comput. 9, 8 (
Nov. 1997), 1735–1780.Google ScholarDigital Library - [17] . 2018. Self-erasing network for integral object attention. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates, Red Hook, NY, 547–557.Google ScholarDigital Library
- [18] . 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, Los Alamitos, CA, 804–813.Google ScholarCross Ref
- [19] . 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarCross Ref
- [20] . 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE Computer Society, Los Alamitos, CA, 4555–4564.
DOI: Google ScholarCross Ref - [21] . 2018. Background extraction using random walk image fusion. IEEE Trans. Cybernet. 48, 1 (2018), 423–435.
DOI: Google ScholarCross Ref - [22] . 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article
79 (July 2020), 19 pages.Google ScholarDigital Library - [23] . 2020. A combined deep CNN-LSTM network for the detection of novel coronavirus (COVID-19) using X-ray images. Inform. Med. Unlocked 20 (2020), 100412.Google ScholarCross Ref
- [24] . 2020. Visual-Semantic Graph Matching for Visual Grounding. Association for Computing Machinery, New York, NY, 4041–4050.Google Scholar
- [25] . 2021. MDETR—Modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 1780–1790.Google ScholarCross Ref
- [26] . 2014. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 787–798.Google ScholarCross Ref
- [27] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 1 (2017), 32–73.Google ScholarDigital Library
- [28] . 2020. Robust ECG R-Peak Detection Using LSTM. Association for Computing Machinery, New York, NY, 1104–1111.Google Scholar
- [29] . 2021. Multi-human parsing with a graph-based generative adversarial model. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1, Article
29 (Apr. 2021), 21 pages.Google ScholarDigital Library - [30] . 2018. GLA: Global-local attention for image description. IEEE Trans. Multimedia 20, 3 (2018), 726–737.Google ScholarDigital Library
- [31] . 2019. A hierarchical CNN-RNN approach for visual emotion classification. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3s, Article
97 (Dec. 2019), 17 pages.Google ScholarDigital Library - [32] . 2022. Cross-modality synergy network for referring expression comprehension and segmentation. Neurocomputing 467 (2022), 99–114.Google ScholarDigital Library
- [33] . 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision (ECCV’20), , , , and (Eds.). Springer International Publishing, Cham, 121–137.Google ScholarDigital Library
- [34] . 2020. A double channel CNN-LSTM model for text classification. In Proceedings of the IEEE 22nd International Conference on High Performance Computing and Communications, the IEEE 18th International Conference on Smart City, and the IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS’20). 1316–1321.Google Scholar
- [35] . 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarCross Ref
- [36] . 2017. Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, Los Alamitos, CA, 4866–4874.
DOI: Google ScholarCross Ref - [37] . 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1950–1959.Google Scholar
- [38] . 2019. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection. ACM Trans. Multimedia Comput. Commun. Appl. 15, 4, Article
107 (Dec. 2019), 23 pages.Google ScholarDigital Library - [39] . 2021. An advanced CNN-LSTM model for cryptocurrency forecasting. Electronics 10, 3 (2021).
DOI: Google ScholarCross Ref - [40] . 2021. Facial chirality: Using self-face reflection to learn discriminative features for facial expression recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’21). IEEE Computer Society, Los Alamitos, CA, 1–6.
DOI: Google ScholarCross Ref - [41] . 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, , , , , , and (Eds.), Vol. 32. Curran Associates.Retrieved from https://proceedings.neurips.cc/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf.Google Scholar
- [42] . 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE Computer Society, Los Alamitos, CA, 10031–10040.Google ScholarCross Ref
- [43] . 2017. Comprehension-guided referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE Computer Society, Los Alamitos, CA, 3125–3134.
DOI: Google ScholarCross Ref - [44] . 2020. FIN: Feature integrated network for object detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2, Article
48 (May 2020), 18 pages.Google ScholarDigital Library - [45] . 2021. Spatiotemporal dilated convolution with uncertain matching for video-based crowd estimation. IEEE Trans. Multimedia (
Jan. 2021). Retrieved from https://arxiv.org/abs/2101.12439.DOI: Google ScholarDigital Library - [46] . 2014. The stanford coreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 55–60.Google ScholarCross Ref
- [47] . 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 11–20.
DOI: Google ScholarCross Ref - [48] . 2016. Modeling context between objects for referring expression understanding. In Proceedings of the European Conference on Computer Vision (ECCV’16).Google ScholarCross Ref
- [49] . 2011. Im2text: Describing images using 1 million captioned photographs. Adv. Neural Info. Process. Syst. 24 (2011).Google Scholar
- [50] . 2017. Accurate single stage detector using recurrent rolling convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), 752–760.Google Scholar
- [51] . 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (
June 2017), 1137–1149.DOI: Google ScholarDigital Library - [52] . 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV’16), , , , and (Eds.). Springer International Publishing, Cham, 817–834.Google ScholarCross Ref
- [53] . 2016. A comparative study of data fusion for RGB-D based visual recognition. Pattern Recogn. Lett. 73 (2016), 1–6.
DOI: Google ScholarDigital Library - [54] . 2019. Large scale datasets for image and video captioning in italian. Ital. J. Comput. Ling. 2, 5 (2019), 49–60.Google Scholar
- [55] . 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlin. Phenom. 404 (2020), 132306.Google ScholarCross Ref
- [56] . 2020. S2SiamFC: Self-Supervised Fully Convolutional Siamese Network for Visual Tracking. Association for Computing Machinery, New York, NY, 1948–1957.Google Scholar
- [57] . 2021. Co-grounding networks with semantic attention for referring expression comprehension in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 1346–1355.Google ScholarCross Ref
- [58] . 2020. Multi-view graph matching for 3D model retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article
77 (2020), 20 pages.DOI: Google ScholarDigital Library - [59] . 2021. Proposal-free one-stage referring expression via grid-word cross-attention. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI’21). International Joint Conferences on Artificial Intelligence Organization, 1032–1038.
DOI: Google ScholarCross Ref - [60] . 2014. Learning and recognition of on-premise signs from weakly labeled street view images. IEEE Trans. Image Process. 23, 3 (2014), 1047–1059.
DOI: Google ScholarDigital Library - [61] . 2018. Background extraction based on joint Gaussian conditional random fields. IEEE Trans. Circ. Syst. Video Technol. 28, 11 (2018), 3127–3140.
DOI: Google ScholarDigital Library - [62] . 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). IEEE Computer Society, Los Alamitos, CA, 1960–1968.Google ScholarCross Ref
- [63] . 2021. Universal-prototype enhancing for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 9567–9576.Google ScholarCross Ref
- [64] . 2021. A human-like traffic scene understanding system: A survey. IEEE Industr. Electr. Mag. 15, 1 (2021), 6–15.
DOI: Google ScholarCross Ref - [65] . 2020. AU-Assisted Graph Attention Convolutional Network for Micro-Expression Recognition. Association for Computing Machinery, New York, NY, 2871–2880.Google Scholar
- [66] . 2022. WTRPNet: An explainable graph feature convolutional neural network for epileptic EEG classification. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3s, Article
107 (Dec. 2022), 18 pages.Google Scholar - [67] . 2021. Dual-stream structured graph convolution network for skeleton-based action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4, Article
120 (Nov. 2021), 22 pages.Google ScholarDigital Library - [68] . 2021. Cross-modal hybrid feature fusion for image-sentence matching. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4, Article
127 (Nov. 2021), 23 pages.DOI: Google ScholarDigital Library - [69] . 2018. Graph R-CNN for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarDigital Library
- [70] . 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 4140–4149.Google ScholarCross Ref
- [71] . 2019. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE Computer Society, 4643–4652.
DOI: Google ScholarCross Ref - [72] . 2020. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarCross Ref
- [73] . 2018. MAttNet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1307–1315.Google Scholar
- [74] . 2016. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer International Publishing, Cham, 69–85.Google ScholarCross Ref
- [75] . 2017. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3521–3529.Google Scholar
- [76] . 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4158–4166.Google Scholar
- [77] . 2020. Frame augmented alternating attention network for video question answering. IEEE Trans. Multimedia 22, 4 (2020), 1032–1041.Google ScholarDigital Library
- [78] . 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 4252–4261.Google ScholarCross Ref
Index Terms
- Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks
Recommendations
Towards Further Comprehension on Referring Expression with Rationale
MM '22: Proceedings of the 30th ACM International Conference on MultimediaReferring Expression Comprehension (REC) is one important research branch in visual grounding, where the goal of REC is to localize a relevant object in the image, given an expression in the form of text to exactly describe a specific object. However, ...
Exploring Logical Reasoning for Referring Expression Comprehension
MM '21: Proceedings of the 29th ACM International Conference on MultimediaReferring expression comprehension aims to localize the target object in an image referred by a natural language expression. Most existing approaches neglect the implicit logical correlations among fine-grained cues, e.g., categories, attributes, which ...
TextREC: A Dataset for Referring Expression Comprehension with Reading Comprehension
Document Analysis and Recognition - ICDAR 2023AbstractReferring expression comprehension (REC) aims at locating a specific object within a scene given a natural language expression. Although referring expression comprehension has achieved tremendous progress, most of today’s REC models ignore the ...
Comments