skip to main content
research-article

Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks

Published:06 February 2023Publication History
Skip Abstract Section

Abstract

Referring expression comprehension aims to localize a specific object in an image according to a given language description. It is still challenging to comprehend and mitigate the gap between various types of information in the visual and textual domains. Generally, it needs to extract the salient features from a given expression and match the features of expression to an image. One challenge in referring expression comprehension is the number of region proposals generated by object detection methods is far more than the number of entities in the corresponding language description. Remarkably, the candidate regions without described by the expression will bring a severe impact on referring expression comprehension. To tackle this problem, we first propose a novel Enhanced Cross-modal Graph Attention Networks (ECMGANs) that boosts the matching between the expression and the entity position of an image. Then, an effective strategy named Graph Node Erase (GNE) is proposed to assist ECMGANs in eliminating the effect of irrelevant objects on the target object. Experiments on three public referring expression comprehension datasets show unambiguously that our ECMGANs framework achieves better performance than other state-of-the-art methods. Moreover, GNE is able to obtain higher accuracies of visual-expression matching effectively.

REFERENCES

  1. [1] Bell Sean, Zitnick C. Lawrence, Bala Kavita, and Girshick Ross B.. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 28742883.Google ScholarGoogle Scholar
  2. [2] Berhich Asmae, Belouadha Fatima-Zahra, and Kabbaj Mohammed Issam. 2020. LSTM-based models for earthquake prediction. In Proceedings of the 3rd International Conference on Networking, Information Systems Security (NISS’20). Association for Computing Machinery, New York, NY, Article 46, 7 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Chen Chaoqi, Zheng Zebiao, Huang Yue, Ding Xinghao, and Yu Yizhou. 2021. I3Net: Implicit instance-invariant network for adapting one-stage object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 1257612585.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Long, Ma Wenbo, Xiao Jun, Zhang Hanwang, and Chang Shih-Fu. 2021. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI’21), the 33rd Conference on Innovative Applications of Artificial Intelligence (IAAI’21), and the 11th Symposium on Educational Advances in Artificial Intelligence (EAAI’21). AAAI Press, 10361044.Google ScholarGoogle Scholar
  5. [5] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2020. UNITER: UNiversal image-text representation learning. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 104120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cheng Wen-Huang, Song Sijie, Chen Chieh-Yun, Hidayati Shintami Chusnul, and Liu Jiaying. 2021. Fashion meets computer vision: A survey. ACM Comput. Surv. 54, 4, Article 72 (jul 2021), 41 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Cui Yuhao, Yu Zhou, Wang Chunqi, Zhao Zhongzhou, Zhang Ji, Wang Meng, and Yu Jun. 2021. ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-Modal Knowledge Integration. Association for Computing Machinery, New York, NY, 797806. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Deng Chaorui, Wu Qi, Wu Qingyao, Hu Fuyuan, Lyu Fan, and Tan Mingkui. 2018. Visual grounding via accumulated attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 77467755.Google ScholarGoogle Scholar
  9. [9] Deng Jiajun, Yang Zhengyuan, Chen Tianlang, Zhou Wengang, and Li Houqiang. 2021. TransVG: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 17691779.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Fan Hehe, Zhu Linchao, Yang Yi, and Wu Fei. 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 78 (July 2020), 16 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Fazal Muhammad Abu Ul, Ferguson Sam, and Johnston Andrew. 2020. Evaluation of information comprehension in concurrent speech-based designs. ACM Trans. Multimedia Comput. Commun. Appl. 16, 4, Article 129 (Dec. 2020), 19 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Gao Peng, Lu Pan, Li Hongsheng, Li Shuang, Li Yikang, Hoi Steven C. H., and Wang Xiaogang. 2018. Question-guided hybrid convolution for visual question answering. Retrieved from https://arxiv.org/abs/1808.02632.Google ScholarGoogle Scholar
  13. [13] Guzman Andrea L. and Lewis Seth C.. 2020. Artificial intelligence and communication: A human-machine communication research agenda. New Media Soc. 22, 1 (2020), 7086.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Han Guangxing, He Yicheng, Huang Shiyuan, Ma Jiawei, and Chang Shih-Fu. 2021. Query adaptive few-shot object detection with heterogeneous graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 32633272.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] He Dailan, Zhao Yusheng, Luo Junyu, Hui Tianrui, Huang Shaofei, Zhang Aixi, and Liu Si. 2021. TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. Association for Computing Machinery, New York, NY, 23442352.Google ScholarGoogle Scholar
  16. [16] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Comput. 9, 8 (Nov. 1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Hou Qibin, Jiang Peng-Tao, Wei Yunchao, and Cheng Ming-Ming. 2018. Self-erasing network for integral object attention. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates, Red Hook, NY, 547557.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Hu R., Andreas J., Rohrbach M., Darrell T., and Saenko K.. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, Los Alamitos, CA, 804813.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Hu Ronghang, Rohrbach Marcus, Andreas Jacob, Darrell Trevor, and Saenko Kate. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Hu R., Xu H., Rohrbach M., Feng J., Saenko K., and Darrell T.. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE Computer Society, Los Alamitos, CA, 45554564. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Hua Kai-Lung, Wang Hong-Cyuan, Yeh Chih-Hsiang, Cheng Wen-Huang, and Lai Yu-Chi. 2018. Background extraction using random walk image fusion. IEEE Trans. Cybernet. 48, 1 (2018), 423435. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Huang Feiran, Wei Kaimin, Weng Jian, and Li Zhoujun. 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 79 (July 2020), 19 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Islam Md. Zabirul, Islam Md. Milon, and Asraf Amanullah. 2020. A combined deep CNN-LSTM network for the detection of novel coronavirus (COVID-19) using X-ray images. Inform. Med. Unlocked 20 (2020), 100412.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Jing Chenchen, Wu Yuwei, Pei Mingtao, Hu Yao, Jia Yunde, and Wu Qi. 2020. Visual-Semantic Graph Matching for Visual Grounding. Association for Computing Machinery, New York, NY, 40414050.Google ScholarGoogle Scholar
  25. [25] Kamath Aishwarya, Singh Mannat, LeCun Yann, Synnaeve Gabriel, Misra Ishan, and Carion Nicolas. 2021. MDETR—Modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 17801790.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Kazemzadeh Sahar, Ordonez Vicente, Matten Mark, and Berg Tamara. 2014. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 787798.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Laitala Juho, Jiang Mingzhe, Syrjälä Elise, Naeini Emad Kasaeyan, Airola Antti, Rahmani Amir M., Dutt Nikil D., and Liljeberg Pasi. 2020. Robust ECG R-Peak Detection Using LSTM. Association for Computing Machinery, New York, NY, 11041111.Google ScholarGoogle Scholar
  29. [29] Li Jianshu, Zhao Jian, Lang Congyan, Li Yidong, Wei Yunchao, Guo Guodong, Sim Terence, Yan Shuicheng, and Feng Jiashi. 2021. Multi-human parsing with a graph-based generative adversarial model. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1, Article 29 (Apr. 2021), 21 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Li Linghui, Tang Sheng, Zhang Yongdong, Deng Lixi, and Tian Qi. 2018. GLA: Global-local attention for image description. IEEE Trans. Multimedia 20, 3 (2018), 726737.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Li Liang, Zhu Xinge, Hao Yiming, Wang Shuhui, Gao Xingyu, and Huang Qingming. 2019. A hierarchical CNN-RNN approach for visual emotion classification. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3s, Article 97 (Dec. 2019), 17 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Li Qianzhong, Zhang Yujia, Sun Shiying, Wu Jinting, Zhao Xiaoguang, and Tan Min. 2022. Cross-modality synergy network for referring expression comprehension and segmentation. Neurocomputing 467 (2022), 99114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Li Xiujun, Yin Xi, Li Chunyuan, Zhang Pengchuan, Hu Xiaowei, Zhang Lei, Wang Lijuan, Hu Houdong, Dong Li, Wei Furu, Choi Yejin, and Gao Jianfeng. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 121137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Liang Shengbin, Zhu Bin, Zhang Yuying, Cheng Suying, and Jin Jiangyong. 2020. A double channel CNN-LSTM model for text classification. In Proceedings of the IEEE 22nd International Conference on High Performance Computing and Communications, the IEEE 18th International Conference on Smart City, and the IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS’20). 13161321.Google ScholarGoogle Scholar
  35. [35] Liao Yue, Liu Si, Li Guanbin, Wang Fei, Chen Yanjie, Qian Chen, and Li Bo. 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Liu J., Wang L., and Yang M.. 2017. Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, Los Alamitos, CA, 48664874. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Liu Xihui, Wang Zihao, Shao Jing, Wang Xiaogang, and Li Hongsheng. 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 19501959.Google ScholarGoogle Scholar
  38. [38] Liu Zhandong, Zhou Wengang, and Li Houqiang. 2019. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection. ACM Trans. Multimedia Comput. Commun. Appl. 15, 4, Article 107 (Dec. 2019), 23 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Livieris Ioannis E., Kiriakidou Niki, Stavroyiannis Stavros, and Pintelas Panagiotis. 2021. An advanced CNN-LSTM model for cryptocurrency forecasting. Electronics 10, 3 (2021). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Lo L., Xie H., Shuai H., and Cheng W.. 2021. Facial chirality: Using self-face reflection to learn discriminative features for facial expression recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’21). IEEE Computer Society, Los Alamitos, CA, 16. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Fox E., and Garnett R. (Eds.), Vol. 32. Curran Associates.Retrieved from https://proceedings.neurips.cc/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf.Google ScholarGoogle Scholar
  42. [42] Luo G., Zhou Y., Sun X., Cao L., Wu C., Deng C., and Ji R.. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE Computer Society, Los Alamitos, CA, 1003110040.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Luo R. and Shakhnarovich G.. 2017. Comprehension-guided referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE Computer Society, Los Alamitos, CA, 31253134. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Luo Xiaofan, Wong Fukoeng, and Hu Haifeng. 2020. FIN: Feature integrated network for object detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2, Article 48 (May 2020), 18 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Ma Yu Jen, Shuai Hong Han, and Cheng Wen Huang. 2021. Spatiotemporal dilated convolution with uncertain matching for video-based crowd estimation. IEEE Trans. Multimedia (Jan. 2021). Retrieved from https://arxiv.org/abs/2101.12439. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Manning Christopher, Surdeanu Mihai, Bauer John, Finkel Jenny, Bethard Steven, and McClosky David. 2014. The stanford coreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 5560.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Mao Junhua, Huang Jonathan, Toshev Alexander, Camburu Oana, Yuille Alan L., and Murphy Kevin. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1120. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Nagaraja Varun K., Morariu Vlad I., and Davis Larry S.. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the European Conference on Computer Vision (ECCV’16).Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Ordonez Vicente, Kulkarni Girish, and Berg Tamara. 2011. Im2text: Describing images using 1 million captioned photographs. Adv. Neural Info. Process. Syst. 24 (2011).Google ScholarGoogle Scholar
  50. [50] Ren Jimmy S. J., Chen Xiaohao, Liu Jianbo, Sun Wenxiu, Pang Jiahao, Yan Qiong, Tai Yu-Wing, and Xu Li. 2017. Accurate single stage detector using recurrent rolling convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), 752760.Google ScholarGoogle Scholar
  51. [51] Ren S., He K., Girshick R., and Sun J.. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (June 2017), 11371149. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Rohrbach Anna, Rohrbach Marcus, Hu Ronghang, Darrell Trevor, and Schiele Bernt. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV’16), Leibe Bastian, Matas Jiri, Sebe Nicu, and Welling Max (Eds.). Springer International Publishing, Cham, 817834.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Sanchez-Riera Jordi, Hua Kai-Lung, Hsiao Yuan-Sheng, Lim Tekoing, Hidayati Shintami C., and Cheng Wen-Huang. 2016. A comparative study of data fusion for RGB-D based visual recognition. Pattern Recogn. Lett. 73 (2016), 16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Scaiella Antonio, Croce Danilo, and Basili Roberto. 2019. Large scale datasets for image and video captioning in italian. Ital. J. Comput. Ling. 2, 5 (2019), 4960.Google ScholarGoogle Scholar
  55. [55] Sherstinsky Alex. 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlin. Phenom. 404 (2020), 132306.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Sio Chon Hou, Ma Yu-Jen, Shuai Hong-Han, Chen Jun-Cheng, and Cheng Wen-Huang. 2020. S2SiamFC: Self-Supervised Fully Convolutional Siamese Network for Visual Tracking. Association for Computing Machinery, New York, NY, 19481957.Google ScholarGoogle Scholar
  57. [57] Song Sijie, Lin Xudong, Liu Jiaying, Guo Zongming, and Chang Shih-Fu. 2021. Co-grounding networks with semantic attention for referring expression comprehension in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 13461355.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Su Yu-Ting, Li Wen-Hui, Nie Wei-Zhi, and Liu An-An. 2020. Multi-view graph matching for 3D model retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 77 (2020), 20 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Suo Wei, Sun MengYang, Wang Peng, and Wu Qi. 2021. Proposal-free one-stage referring expression via grid-word cross-attention. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI’21). International Joint Conferences on Artificial Intelligence Organization, 10321038. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Tsai Tsung-Hung, Cheng Wen-Huang, You Chuang-Wen, Hu Min-Chun, Tsui Arvin Wen, and Chi Heng-Yu. 2014. Learning and recognition of on-premise signs from weakly labeled street view images. IEEE Trans. Image Process. 23, 3 (2014), 10471059. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Wang Hong-Cyuan, Lai Yu-Chi, Cheng Wen-Huang, Cheng Chin-Yun, and Hua Kai-Lung. 2018. Background extraction based on joint Gaussian conditional random fields. IEEE Trans. Circ. Syst. Video Technol. 28, 11 (2018), 31273140. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Wang P., Wu Q., Cao J., Shen C., Gao L., and Hengel A.. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). IEEE Computer Society, Los Alamitos, CA, 19601968.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Wu Aming, Han Yahong, Zhu Linchao, and Yang Yi. 2021. Universal-prototype enhancing for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 95679576.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Xia Zi-Xiang, Lai Wei-Cheng, Tsao Li-Wu, Hsu Lien-Feng, Yu Chih-Chia Hu, Shuai Hong-Han, and Cheng Wen-Huang. 2021. A human-like traffic scene understanding system: A survey. IEEE Industr. Electr. Mag. 15, 1 (2021), 615. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Xie Hong-Xia, Lo Ling, Shuai Hong-Han, and Cheng Wen-Huang. 2020. AU-Assisted Graph Attention Convolutional Network for Micro-Expression Recognition. Association for Computing Machinery, New York, NY, 28712880.Google ScholarGoogle Scholar
  66. [66] Xin Qi, Hu Shaohao, Liu Shuaiqi, Zhao Ling, and Wang Shuihua. 2022. WTRPNet: An explainable graph feature convolutional neural network for epileptic EEG classification. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3s, Article 107 (Dec. 2022), 18 pages.Google ScholarGoogle Scholar
  67. [67] Xu Chunyan, Liu Rong, Zhang Tong, Cui Zhen, Yang Jian, and Hu Chunlong. 2021. Dual-stream structured graph convolution network for skeleton-based action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4, Article 120 (Nov. 2021), 22 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Xu Xing, Wang Yifan, He Yixuan, Yang Yang, Hanjalic Alan, and Shen Heng Tao. 2021. Cross-modal hybrid feature fusion for image-sentence matching. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4, Article 127 (Nov. 2021), 23 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Yang Jianwei, Lu Jiasen, Lee Stefan, Batra Dhruv, and Parikh Devi. 2018. Graph R-CNN for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. [70] Yang S., Li G., and Yu Y.. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 41404149.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Yang S., Li G., and Yu Y.. 2019. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE Computer Society, 46434652. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Yang Sibei, Li Guanbin, and Yu Yizhou. 2020. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Yu Licheng, Lin Zhe L., Shen Xiaohui, Yang Jimei, Lu Xin, Bansal Mohit, and Berg Tamara L.. 2018. MAttNet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13071315.Google ScholarGoogle Scholar
  74. [74] Yu Licheng, Poirson Patrick, Yang Shan, Berg Alexander C., and Berg Tamara L.. 2016. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer International Publishing, Cham, 6985.Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Yu Licheng, Tan Hao, Bansal Mohit, and Berg Tamara L.. 2017. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 35213529.Google ScholarGoogle Scholar
  76. [76] Zhang Hanwang, Niu Yulei, and Chang Shih-Fu. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 41584166.Google ScholarGoogle Scholar
  77. [77] Zhang Wenqiao, Tang Siliang, Cao Yanpeng, Pu Shiliang, Wu Fei, and Zhuang Yueting. 2020. Frame augmented alternating attention network for video question answering. IEEE Trans. Multimedia 22, 4 (2020), 10321041.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. [78] Zhuang Bohan, Wu Qi, Shen Chunhua, Reid Ian, and Hengel Anton van den. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 42524261.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2
          March 2023
          540 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3572860
          • Editor:
          • Abdulmotaleb El Saddik
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 February 2023
          • Online AM: 15 July 2022
          • Accepted: 27 June 2022
          • Revised: 18 May 2022
          • Received: 4 January 2022
          Published in tomm Volume 19, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)378
          • Downloads (Last 6 weeks)41

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format