skip to main content
research-article

Guided Graph Attention Learning for Video-Text Matching

Published:06 January 2023Publication History
Skip Abstract Section

Abstract

As a bridge between videos and natural languages, video-text matching has been a hot multimedia research topic in recent years. Such cross-modal retrieval is usually achieved by learning a common embedding space where videos and text captions are directly comparable. It is still challenging because existing visual representations cannot exploit semantic correlations within videos well, resulting in a mismatch with semantic concepts that are contained in the corresponding text descriptions. In this article, we propose a new Guided Graph Attention Learning (GGAL) model to enhance video embedding learning by capturing important region-level semantic concepts within the spatiotemporal space. Our model builds connections between object regions and performs hierarchical graph reasoning on both frame-level and whole video–level region graphs. During this process, global context is used to guide attention learning on this hierarchical graph topology so that the learned overall video embedding can focus on essential semantic concepts and can be better aligned with text captions. Experiments on commonly used benchmarks validate that GGAL outperforms many recent video-text retrieval methods with a clear margin. As multimedia data in dynamic environments becomes critically important, we also validate GGAL learned video-text representations that can be generalized well to unseen out-of-domain data via cross-dataset evaluations. To further investigate the interpretability of our model, we visualize attention weights learned by GGAL models. We find that GGAL successfully focuses on key semantic concepts in the video and has complementary attention on the context parts based on different ways of building region graphs.

REFERENCES

  1. [1] Abu-El-Haija Sami, Perozzi Bryan, Al-Rfou Rami, and Alemi Alexander A.. 2018. Watch your step: Learning node embeddings via graph attention. Advances in Neural Information Processing Systems (NeurIPS’18) (2018), 91989208.Google ScholarGoogle Scholar
  2. [2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Bertasius Gedas, Torresani Lorenzo, Yu Stella X., and Shi Jianbo. 2017. Convolutional random walk networks for semantic image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 858866.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Cadene Remi, Ben-Younes Hedi, Cord Matthieu, and Thome Nicolas. 2019. Murel: Multimodal relational reasoning for visual question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 19891998.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Cao Chunshui, Liu Xianming, Yang Yi, Yu Yinan, Wang Jiang, Wang Zilei, Huang Yongzhen, Wang Liang, Huang Chang, Xu Wei, et al. 2015. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’15). 29562964.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 62996308.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chandra Siddhartha, Usunier Nicolas, and Kokkinos Iasonas. 2017. Dense and low-rank Gaussian CRFs using deep embeddings. In IEEE/CVF International Conference on Computer Vision (ICCV’17). 51035112.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chen David L. and Dolan William B.. 2011. Collecting highly parallel data for paraphrase evaluation. In Annual Meeting of the Association for Computational Linguistics (ACL’11). 190200.Google ScholarGoogle Scholar
  9. [9] Chen Feiyu, Shao Jie, Zhang Yonghui, Xu Xing, and Shen Heng Tao. 2020. Interclass-relativity-adaptive metric learning for cross-modal matching and beyond. IEEE Transactions on Multimedia 23 (2020), 30733084.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Chen Jiacheng, Hu Hexiang, Wu Hao, Jiang Yuning, and Wang Changhu. 2021. Learning the best pooling strategy for visual semantic embedding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 1578915798.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Chen Qingchao and Albanie Samuel. 2021. Mind-the-Gap! Unsupervised domain adaptation for text-video retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 10721080.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Chen Shizhe, Zhao Yida, Jin Qin, and Wu Qi. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 1063810647.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 17241734.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Choo Sungkwon, Ha Seong Jong, and Lee Joonsoo. 2021. Semantic-preserving metric learning for video-text retrieval. In IEEE International Conference on Image Processing (ICIP’21). IEEE, 23882392.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Chung Junyoung, Gulcehre Caglar, Cho KyungHyun, and Bengio Yoshua. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv (2014), 19.Google ScholarGoogle Scholar
  16. [16] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’19). 41714186.Google ScholarGoogle Scholar
  17. [17] Dong Jianfeng, Li Xirong, Xu Chaoxi, Ji Shouling, He Yuan, Yang Gang, and Wang Xun. 2019. Dual encoding for zero-example video retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 93469355.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Dong Jianfeng, Li Xirong, Xu Chaoxi, Yang Xun, Yang Gang, Wang Xun, and Wang Meng. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 11.Google ScholarGoogle Scholar
  19. [19] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, and Fidler Sanja. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In British Machine Vision Conference (BMVC’18). 113.Google ScholarGoogle Scholar
  20. [20] Feng Zerun, Zeng Zhimin, Guo Caili, and Li Zheng. 2020. Exploiting visual semantic reasoning for video-text retrieval. In International Joint Conference on Artificial Intelligence (IJCAI 20). 10051011.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Francis Danny, Nguyen Phuong Anh, Huet Benoit, and Ngo Chong-Wah. 2019. Fusion of multimodal embeddings for ad-hoc video search. In IEEE/CVF International Conference on Computer Vision (ICCV’19) Workshops. 18681872.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Gabeur Valentin, Sun Chen, Alahari Karteek, and Schmid Cordelia. 2020. Multi-modal transformer for video retrieval. In European Conference on Computer Vision (ECCV’20). 214229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Gao Zijian, Liu Jingyu, Chen Sheng, Chang Dedan, Zhang Hao, and Yuan Jinwei. 2021. CLIP2TV: An empirical study on transformer-based methods for video-text retrieval. arXiv:2111.05610 (2021), 117.Google ScholarGoogle Scholar
  24. [24] Guo Dongyan, Shao Yanyan, Cui Ying, Wang Zhenhua, Zhang Liyan, and Shen Chunhua. 2021. Graph attention tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 95439552.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Harnad Stevan. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena 42, 1-3 (1990), 335346.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). 770778.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Hobbs Jerry R., Stickel Mark E., and Martin Paul. 1993. Interpretation as abduction. Artificial Intelligence 63, 1-2 (1993), 69142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 71327141.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Ji Ruyi, Liu Zeyu, Zhang Libo, Liu Jianwei, Zuo Xin, Wu Yanjun, Zhao Chen, Wang Haofeng, and Yang Lin. 2021. Multi-peak graph-based multi-instance learning for weakly supervised object detection. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 2s (2021), 121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’15). 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Katsuki Fumi and Constantinidis Christos. 2014. Bottom-up and top-down attention: Different processes and overlapping neural systems. The Neuroscientist 20, 5 (2014), 509521.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Kaufman Dotan, Levi Gil, Hassner Tal, and Wolf Lior. 2017. Temporal tessellation: A unified approach for video analysis. In IEEE/CVF International Conference on Computer Vision (ICCV’17). 94104.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv (2014), 115.Google ScholarGoogle Scholar
  34. [34] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In ICLR’17. 114.Google ScholarGoogle Scholar
  35. [35] Kiros Ryan, Salakhutdinov Ruslan, and Zemel Richard S.. 2015. Unifying visual-semantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics (2015), 113.Google ScholarGoogle Scholar
  36. [36] Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Niebles Juan Carlos. 2017. Dense-captioning events in videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 706715.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Lao Ni, Mitchell Tom, and Cohen William W.. 2011. Random walk inference and learning in a large scale knowledge base. In Conference on Empirical Methods in Natural Language Processing (EMNLP’11). 529539.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In European Conference on Computer Vision (ECCV’18). 201216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Lei Jie, Li Linjie, Zhou Luowei, Gan Zhe, Berg Tamara L., Bansal Mohit, and Liu Jingjing. 2021. Less is more: ClipBERT for video-and-language learning via sparse sampling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 73317341.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Li Kunpeng, Fang Chen, Wang Zhaowen, Kim Seokhwan, Jin Hailin, and Fu Yun. 2020. Screencast tutorial video understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 1252612535.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Li Kunpeng, Wu Ziyan, Peng Kuan-Chuan, Ernst Jan, and Fu Yun. 2019. Guided attention inference network. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 12 (2019), 29963010.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, and Fu Yun. 2019. Visual semantic reasoning for image-text matching. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 46544662.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, and Fu Yun. 2022. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 114.Google ScholarGoogle Scholar
  45. [45] Liu Yang, Albanie Samuel, Nagrani Arsha, and Zisserman Andrew. 2019. Use what you have: Video retrieval using representations from collaborative experts. In British Machine Vision Conference (BMVC’19). 119.Google ScholarGoogle Scholar
  46. [46] Luo Huaishao, Ji Lei, Shi Botian, Huang Haoyang, Duan Nan, Li Tianrui, Li Jason, Bharti Taroon, and Zhou Ming. 2020. UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353 (2020), 116.Google ScholarGoogle Scholar
  47. [47] Luo Huaishao, Ji Lei, Zhong Ming, Chen Yang, Lei Wen, Duan Nan, and Li Tianrui. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv:2104.08860 (2021), 293304.Google ScholarGoogle Scholar
  48. [48] Miech Antoine, Alayrac Jean-Baptiste, Smaira Lucas, Laptev Ivan, Sivic Josef, and Zisserman Andrew. 2020. End-to-end learning of visual representations from uncurated instructional videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 98799889.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Miech Antoine, Zhukov Dimitri, Alayrac Jean-Baptiste, Tapaswi Makarand, Laptev Ivan, and Sivic Josef. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 26302640.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Mithun Niluthpol Chowdhury, Li Juncheng, Metze Florian, and Roy-Chowdhury Amit K.. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ACM International Conference on Multimedia Retrieval (ICMR’18). 1927.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Newell Allen. 1980. Physical symbol systems. Cognitive Science 4, 2 (1980), 135183.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Norcliffe-Brown Will, Vafeias Stathis, and Parisot Sarah. 2018. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems (NeurIPS’18), Vol. 31. 110.Google ScholarGoogle Scholar
  53. [53] Pan Boxiao, Cai Haoye, Huang De-An, Lee Kuan-Hui, Gaidon Adrien, Adeli Ehsan, and Niebles Juan Carlos. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 1087010879.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Portillo-Quintero Jesús Andrés, Ortiz-Bayliss José Carlos, and Terashima-Marín Hugo. 2021. A straightforward framework for video retrieval using CLIP. In Mexican Conference on Pattern Recognition (MCPR’21). 312.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Qi Mengshi, Qin Jie, Yang Yi, Wang Yunhong, and Luo Jiebo. 2021. Semantics-aware spatial-temporal binaries for cross-modal video retrieval. IEEE Transactions on Image Processing 30 (2021), 29893004.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML’21). 87488763.Google ScholarGoogle Scholar
  57. [57] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS’15), Vol. 28. 19.Google ScholarGoogle Scholar
  58. [58] Schwartz Idan, Yu Seunghak, Hazan Tamir, and Schwing Alexander G.. 2019. Factor graph attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 20392048.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Selvaraju Ramprasaath R., Cogswell Michael, Das Abhishek, Vedantam Ramakrishna, Parikh Devi, and Batra Dhruv. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In IEEE/CVF International Conference on Computer Vision (ICCV’17). 618626.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Simonyan Karen, Vedaldi Andrea, and Zisserman Andrew. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshop’14. 18.Google ScholarGoogle Scholar
  61. [61] Song Xue, Chen Jingjing, Wu Zuxuan, and Jiang Yu-Gang. 2021. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia (2021), 29142923.Google ScholarGoogle Scholar
  62. [62] Song Yale and Soleymani Mohammad. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 19791988.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Torabi Atousa, Tandon Niket, and Sigal Leonid. 2016. Learning language-visual embedding for movie understanding with natural-language. arXiv (2016), 113.Google ScholarGoogle Scholar
  64. [64] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In IEEE/CVF International Conference on Computer Vision (ICCV’15). 44894497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS’17), Vol. 30. 19.Google ScholarGoogle Scholar
  66. [66] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. 2018. Graph attention networks. In ICLR’18. 112.Google ScholarGoogle Scholar
  67. [67] Venugopalan Subhashini, Rohrbach Marcus, Donahue Jeffrey, Mooney Raymond, Darrell Trevor, and Saenko Kate. 2015. Sequence to sequence-video to text. In IEEE/CVF International Conference on Computer Vision (ICCV’15). 45344542.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Wang Lei, Huang Yuchun, Hou Yaolin, Zhang Shenman, and Shan Jie. 2019. Graph attention convolution for point cloud semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1029610305.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Wang Xiaolong, Girshick Ross, Gupta Abhinav, and He Kaiming. 2018. Non-local neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 77947803.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Wang Xiao, Ji Houye, Shi Chuan, Wang Bai, Ye Yanfang, Cui Peng, and Yu Philip S.. 2019. Heterogeneous graph attention network. In The World Wide Web Conference (WWW’19). 20222032.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. [71] Wang Xiaohan, Zhu Linchao, Wu Yu, and Yang Yi. 2020. Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 113.Google ScholarGoogle Scholar
  72. [72] Wang Xiaohan, Zhu Linchao, and Yang Yi. 2021. T2VLAD: Global-local sequence alignment for text-video retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 50795088.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Wang Zihao, Liu Xihui, Li Hongsheng, Sheng Lu, Yan Junjie, Wang Xiaogang, and Shao Jing. 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 57645773.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Wei Jiwei, Xu Xing, Yang Yang, Ji Yanli, Wang Zheng, and Shen Heng Tao. 2020. Universal weighting metric learning for cross-modal matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 1300513014.Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Wray Michael, Larlus Diane, Csurka Gabriela, and Damen Dima. 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 450459.Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Wu Aming, Zhu Linchao, Han Yahong, and Yang Yi. 2019. Connective cognition network for directional visual commonsense reasoning. In Advances in Neural Information Processing Systems (NeurIPS’19), Vol. 32. 110.Google ScholarGoogle Scholar
  77. [77] Wu Peng, He Xiangteng, Tang Mingqian, Lv Yiliang, and Liu Jing. 2021. HANet: Hierarchical alignment networks for video-text retrieval. In ACM International Conference on Multimedia (ACM MM’21). 35183527.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. [78] Xie Saining, Girshick Ross, Dollár Piotr, Tu Zhuowen, and He Kaiming. 2017. Aggregated residual transformations for deep neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 14921500.Google ScholarGoogle Scholar
  79. [79] Xu Jun, Mei Tao, Yao Ting, and Rui Yong. 2016. MSR-VTT: A large video description dataset for bridging video and language. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). 52885296.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Yang Jianwei, Bisk Yonatan, and Gao Jianfeng. 2021. TACo: Token-aware cascade contrastive learning for video-text alignment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 1156211572.Google ScholarGoogle Scholar
  81. [81] Yang Wei, Wang Xiaolong, Farhadi Ali, Gupta Abhinav, and Mottaghi Roozbeh. 2019. Visual semantic navigation using scene priors. ICLR (2019), 114.Google ScholarGoogle Scholar
  82. [82] Yang Xun, Dong Jianfeng, Cao Yixin, Wang Xun, Wang Meng, and Chua Tat-Seng. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 13391348.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. [83] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In European Conference on Computer Vision (ECCV’18). 684699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. [84] Yu Weijiang, Zhou Jingwen, Yu Weihao, Liang Xiaodan, and Xiao Nong. 2019. Heterogeneous graph learning for visual commonsense reasoning. In Advances in Neural Information Processing Systems (NeurIPS’19), Vol. 32. 110.Google ScholarGoogle Scholar
  85. [85] Yu Youngjae, Kim Jongseok, and Kim Gunhee. 2018. A joint sequence fusion model for video question answering and retrieval. In European Conference on Computer Vision (ECCV’18). 471487.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. [86] Yu Youngjae, Ko Hyungjin, Choi Jongwook, and Kim Gunhee. 2016. Video captioning and retrieval models with semantic attention. In European Conference on Computer Vision (ECCV’16). 114.Google ScholarGoogle Scholar
  87. [87] Yu Youngjae, Ko Hyungjin, Choi Jongwook, and Kim Gunhee. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 31653173.Google ScholarGoogle ScholarCross RefCross Ref
  88. [88] Zeiler Matthew D. and Fergus Rob. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV’14). 818833.Google ScholarGoogle ScholarCross RefCross Ref
  89. [89] Zhang Jianming, Lin Zhe, Brandt Jonathan, Shen Xiaohui, and Sclaroff Stan. 2016. Top-down neural attention by excitation backprop. In European Conference on Computer Vision (ECCV’16). 10841102.Google ScholarGoogle ScholarCross RefCross Ref
  90. [90] Zhang Yulun, Li Kunpeng, Li Kai, Wang Lichen, Zhong Bineng, and Fu Yun. 2018. Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision (ECCV’18). 286301.Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. [91] Zhao Rui, Zheng Kecheng, Zha Zheng-Jun, Xie Hongtao, and Luo Jiebo. 2021. Memory enhanced embedding learning for cross-modal video-text retrieval. arXiv:2103.15686 (2021), 19.Google ScholarGoogle Scholar
  92. [92] Zhou Bolei, Khosla Aditya, Lapedriza Agata, Oliva Aude, and Torralba Antonio. 2016. Learning deep features for discriminative localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). 29212929.Google ScholarGoogle ScholarCross RefCross Ref
  93. [93] Zhou Wei, Xia Zhiwu, Dou Peng, Su Tao, and Hu Haifeng. 2022. Double attention based on graph attention network for image multi-label classification. ACM Transactions on Multimedia Computing, Communications, and Applications (2022), 122.Google ScholarGoogle Scholar
  94. [94] Zhu Linchao and Yang Yi. 2020. ActBERT: Learning global-local video-text representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 87468755.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Guided Graph Attention Learning for Video-Text Matching

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2s
          June 2022
          383 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3561949
          • Editor:
          • Abdulmotaleb El Saddik
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 January 2023
          • Online AM: 9 September 2022
          • Accepted: 6 May 2022
          • Revised: 21 March 2022
          • Received: 16 November 2021
          Published in tomm Volume 18, Issue 2s

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format