research-article

Guided Graph Attention Learning for Video-Text Matching

Authors:
Kunpeng Li

Northeastern University, Boston, Massachusetts, USA

Northeastern University, Boston, Massachusetts, USA

0000-0001-5805-793X
View Profile

,
Chang Liu

Northeastern University, Boston, Massachusetts, USA

Northeastern University, Boston, Massachusetts, USA

0000-0002-0219-4748
View Profile

,
Mike Stopa

Konica Minolta, San Mateo, California, USA

Konica Minolta, San Mateo, California, USA

0000-0002-1418-2437
View Profile

,
Jun Amano

Konica Minolta, San Mateo, California, USA

Konica Minolta, San Mateo, California, USA

0000-0002-9653-794X
View Profile

,
Yun Fu

Northeastern University, Boston, Massachusetts, USA

Northeastern University, Boston, Massachusetts, USA

0000-0002-5098-2853
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18 Issue 2sArticle No.: 131pp 1–23https://doi.org/10.1145/3538533

Published:06 January 2023Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

As a bridge between videos and natural languages, video-text matching has been a hot multimedia research topic in recent years. Such cross-modal retrieval is usually achieved by learning a common embedding space where videos and text captions are directly comparable. It is still challenging because existing visual representations cannot exploit semantic correlations within videos well, resulting in a mismatch with semantic concepts that are contained in the corresponding text descriptions. In this article, we propose a new Guided Graph Attention Learning (GGAL) model to enhance video embedding learning by capturing important region-level semantic concepts within the spatiotemporal space. Our model builds connections between object regions and performs hierarchical graph reasoning on both frame-level and whole video–level region graphs. During this process, global context is used to guide attention learning on this hierarchical graph topology so that the learned overall video embedding can focus on essential semantic concepts and can be better aligned with text captions. Experiments on commonly used benchmarks validate that GGAL outperforms many recent video-text retrieval methods with a clear margin. As multimedia data in dynamic environments becomes critically important, we also validate GGAL learned video-text representations that can be generalized well to unseen out-of-domain data via cross-dataset evaluations. To further investigate the interpretability of our model, we visualize attention weights learned by GGAL models. We find that GGAL successfully focuses on key semantic concepts in the video and has complementary attention on the context parts based on different ways of building region graphs.

REFERENCES

[1] Abu-El-Haija Sami, Perozzi Bryan, Al-Rfou Rami, and Alemi Alexander A.. 2018. Watch your step: Learning node embeddings via graph attention. Advances in Neural Information Processing Systems (NeurIPS’18) (2018), 9198–9208.Google Scholar
[2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 6077–6086.Google ScholarCross Ref
[3] Bertasius Gedas, Torresani Lorenzo, Yu Stella X., and Shi Jianbo. 2017. Convolutional random walk networks for semantic image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 858–866.Google ScholarCross Ref
[4] Cadene Remi, Ben-Younes Hedi, Cord Matthieu, and Thome Nicolas. 2019. Murel: Multimodal relational reasoning for visual question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1989–1998.Google ScholarCross Ref
[5] Cao Chunshui, Liu Xianming, Yang Yi, Yu Yinan, Wang Jiang, Wang Zilei, Huang Yongzhen, Wang Liang, Huang Chang, Xu Wei, et al. 2015. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’15). 2956–2964.Google ScholarDigital Library
[6] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 6299–6308.Google ScholarCross Ref
[7] Chandra Siddhartha, Usunier Nicolas, and Kokkinos Iasonas. 2017. Dense and low-rank Gaussian CRFs using deep embeddings. In IEEE/CVF International Conference on Computer Vision (ICCV’17). 5103–5112.Google ScholarCross Ref
[8] Chen David L. and Dolan William B.. 2011. Collecting highly parallel data for paraphrase evaluation. In Annual Meeting of the Association for Computational Linguistics (ACL’11). 190–200.Google Scholar
[9] Chen Feiyu, Shao Jie, Zhang Yonghui, Xu Xing, and Shen Heng Tao. 2020. Interclass-relativity-adaptive metric learning for cross-modal matching and beyond. IEEE Transactions on Multimedia 23 (2020), 3073–3084.Google ScholarCross Ref
[10] Chen Jiacheng, Hu Hexiang, Wu Hao, Jiang Yuning, and Wang Changhu. 2021. Learning the best pooling strategy for visual semantic embedding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 15789–15798.Google ScholarCross Ref
[11] Chen Qingchao and Albanie Samuel. 2021. Mind-the-Gap! Unsupervised domain adaptation for text-video retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 1072–1080.Google ScholarCross Ref
[12] Chen Shizhe, Zhao Yida, Jin Qin, and Wu Qi. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10638–10647.Google ScholarCross Ref
[13] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724–1734.Google ScholarCross Ref
[14] Choo Sungkwon, Ha Seong Jong, and Lee Joonsoo. 2021. Semantic-preserving metric learning for video-text retrieval. In IEEE International Conference on Image Processing (ICIP’21). IEEE, 2388–2392.Google ScholarCross Ref
[15] Chung Junyoung, Gulcehre Caglar, Cho KyungHyun, and Bengio Yoshua. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv (2014), 1–9.Google Scholar
[16] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’19). 4171–4186.Google Scholar
[17] Dong Jianfeng, Li Xirong, Xu Chaoxi, Ji Shouling, He Yuan, Yang Gang, and Wang Xun. 2019. Dual encoding for zero-example video retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 9346–9355.Google ScholarCross Ref
[18] Dong Jianfeng, Li Xirong, Xu Chaoxi, Yang Xun, Yang Gang, Wang Xun, and Wang Meng. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–1.Google Scholar
[19] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, and Fidler Sanja. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In British Machine Vision Conference (BMVC’18). 1–13.Google Scholar
[20] Feng Zerun, Zeng Zhimin, Guo Caili, and Li Zheng. 2020. Exploiting visual semantic reasoning for video-text retrieval. In International Joint Conference on Artificial Intelligence (IJCAI 20). 1005–1011.Google ScholarCross Ref
[21] Francis Danny, Nguyen Phuong Anh, Huet Benoit, and Ngo Chong-Wah. 2019. Fusion of multimodal embeddings for ad-hoc video search. In IEEE/CVF International Conference on Computer Vision (ICCV’19) Workshops. 1868–1872.Google ScholarCross Ref
[22] Gabeur Valentin, Sun Chen, Alahari Karteek, and Schmid Cordelia. 2020. Multi-modal transformer for video retrieval. In European Conference on Computer Vision (ECCV’20). 214–229.Google ScholarDigital Library
[23] Gao Zijian, Liu Jingyu, Chen Sheng, Chang Dedan, Zhang Hao, and Yuan Jinwei. 2021. CLIP2TV: An empirical study on transformer-based methods for video-text retrieval. arXiv:2111.05610 (2021), 1–17.Google Scholar
[24] Guo Dongyan, Shao Yanyan, Cui Ying, Wang Zhenhua, Zhang Liyan, and Shen Chunhua. 2021. Graph attention tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 9543–9552.Google ScholarCross Ref
[25] Harnad Stevan. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena 42, 1-3 (1990), 335–346.Google ScholarDigital Library
[26] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.Google ScholarCross Ref
[27] Hobbs Jerry R., Stickel Mark E., and Martin Paul. 1993. Interpretation as abduction. Artificial Intelligence 63, 1-2 (1993), 69–142.Google ScholarDigital Library
[28] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 7132–7141.Google ScholarCross Ref
[29] Ji Ruyi, Liu Zeyu, Zhang Libo, Liu Jianwei, Zuo Xin, Wu Yanjun, Zhao Chen, Wang Haofeng, and Yang Lin. 2021. Multi-peak graph-based multi-instance learning for weakly supervised object detection. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 2s (2021), 1–21.Google ScholarDigital Library
[30] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128–3137.Google ScholarCross Ref
[31] Katsuki Fumi and Constantinidis Christos. 2014. Bottom-up and top-down attention: Different processes and overlapping neural systems. The Neuroscientist 20, 5 (2014), 509–521.Google ScholarCross Ref
[32] Kaufman Dotan, Levi Gil, Hassner Tal, and Wolf Lior. 2017. Temporal tessellation: A unified approach for video analysis. In IEEE/CVF International Conference on Computer Vision (ICCV’17). 94–104.Google ScholarCross Ref
[33] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv (2014), 1–15.Google Scholar
[34] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In ICLR’17. 1–14.Google Scholar
[35] Kiros Ryan, Salakhutdinov Ruslan, and Zemel Richard S.. 2015. Unifying visual-semantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics (2015), 1–13.Google Scholar
[36] Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Niebles Juan Carlos. 2017. Dense-captioning events in videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 706–715.Google ScholarCross Ref
[37] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123, 1 (2017), 32–73.Google ScholarDigital Library
[38] Lao Ni, Mitchell Tom, and Cohen William W.. 2011. Random walk inference and learning in a large scale knowledge base. In Conference on Empirical Methods in Natural Language Processing (EMNLP’11). 529–539.Google ScholarDigital Library
[39] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In European Conference on Computer Vision (ECCV’18). 201–216.Google ScholarDigital Library
[40] Lei Jie, Li Linjie, Zhou Luowei, Gan Zhe, Berg Tamara L., Bansal Mohit, and Liu Jingjing. 2021. Less is more: ClipBERT for video-and-language learning via sparse sampling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 7331–7341.Google ScholarCross Ref
[41] Li Kunpeng, Fang Chen, Wang Zhaowen, Kim Seokhwan, Jin Hailin, and Fu Yun. 2020. Screencast tutorial video understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 12526–12535.Google ScholarCross Ref
[42] Li Kunpeng, Wu Ziyan, Peng Kuan-Chuan, Ernst Jan, and Fu Yun. 2019. Guided attention inference network. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 12 (2019), 2996–3010.Google ScholarCross Ref
[43] Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, and Fu Yun. 2019. Visual semantic reasoning for image-text matching. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 4654–4662.Google ScholarCross Ref
[44] Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, and Fu Yun. 2022. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 1–14.Google Scholar
[45] Liu Yang, Albanie Samuel, Nagrani Arsha, and Zisserman Andrew. 2019. Use what you have: Video retrieval using representations from collaborative experts. In British Machine Vision Conference (BMVC’19). 1–19.Google Scholar
[46] Luo Huaishao, Ji Lei, Shi Botian, Huang Haoyang, Duan Nan, Li Tianrui, Li Jason, Bharti Taroon, and Zhou Ming. 2020. UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353 (2020), 1–16.Google Scholar
[47] Luo Huaishao, Ji Lei, Zhong Ming, Chen Yang, Lei Wen, Duan Nan, and Li Tianrui. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv:2104.08860 (2021), 293–304.Google Scholar
[48] Miech Antoine, Alayrac Jean-Baptiste, Smaira Lucas, Laptev Ivan, Sivic Josef, and Zisserman Andrew. 2020. End-to-end learning of visual representations from uncurated instructional videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 9879–9889.Google ScholarCross Ref
[49] Miech Antoine, Zhukov Dimitri, Alayrac Jean-Baptiste, Tapaswi Makarand, Laptev Ivan, and Sivic Josef. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 2630–2640.Google ScholarCross Ref
[50] Mithun Niluthpol Chowdhury, Li Juncheng, Metze Florian, and Roy-Chowdhury Amit K.. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ACM International Conference on Multimedia Retrieval (ICMR’18). 19–27.Google ScholarDigital Library
[51] Newell Allen. 1980. Physical symbol systems. Cognitive Science 4, 2 (1980), 135–183.Google ScholarCross Ref
[52] Norcliffe-Brown Will, Vafeias Stathis, and Parisot Sarah. 2018. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems (NeurIPS’18), Vol. 31. 1–10.Google Scholar
[53] Pan Boxiao, Cai Haoye, Huang De-An, Lee Kuan-Hui, Gaidon Adrien, Adeli Ehsan, and Niebles Juan Carlos. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10870–10879.Google ScholarCross Ref
[54] Portillo-Quintero Jesús Andrés, Ortiz-Bayliss José Carlos, and Terashima-Marín Hugo. 2021. A straightforward framework for video retrieval using CLIP. In Mexican Conference on Pattern Recognition (MCPR’21). 3–12.Google ScholarDigital Library
[55] Qi Mengshi, Qin Jie, Yang Yi, Wang Yunhong, and Luo Jiebo. 2021. Semantics-aware spatial-temporal binaries for cross-modal video retrieval. IEEE Transactions on Image Processing 30 (2021), 2989–3004.Google ScholarCross Ref
[56] Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML’21). 8748–8763.Google Scholar
[57] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS’15), Vol. 28. 1–9.Google Scholar
[58] Schwartz Idan, Yu Seunghak, Hazan Tamir, and Schwing Alexander G.. 2019. Factor graph attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 2039–2048.Google ScholarCross Ref
[59] Selvaraju Ramprasaath R., Cogswell Michael, Das Abhishek, Vedantam Ramakrishna, Parikh Devi, and Batra Dhruv. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In IEEE/CVF International Conference on Computer Vision (ICCV’17). 618–626.Google ScholarCross Ref
[60] Simonyan Karen, Vedaldi Andrea, and Zisserman Andrew. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshop’14. 1–8.Google Scholar
[61] Song Xue, Chen Jingjing, Wu Zuxuan, and Jiang Yu-Gang. 2021. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia (2021), 2914–2923.Google Scholar
[62] Song Yale and Soleymani Mohammad. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1979–1988.Google ScholarCross Ref
[63] Torabi Atousa, Tandon Niket, and Sigal Leonid. 2016. Learning language-visual embedding for movie understanding with natural-language. arXiv (2016), 1–13.Google Scholar
[64] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In IEEE/CVF International Conference on Computer Vision (ICCV’15). 4489–4497.Google ScholarDigital Library
[65] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS’17), Vol. 30. 1–9.Google Scholar
[66] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. 2018. Graph attention networks. In ICLR’18. 1–12.Google Scholar
[67] Venugopalan Subhashini, Rohrbach Marcus, Donahue Jeffrey, Mooney Raymond, Darrell Trevor, and Saenko Kate. 2015. Sequence to sequence-video to text. In IEEE/CVF International Conference on Computer Vision (ICCV’15). 4534–4542.Google ScholarDigital Library
[68] Wang Lei, Huang Yuchun, Hou Yaolin, Zhang Shenman, and Shan Jie. 2019. Graph attention convolution for point cloud semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10296–10305.Google ScholarCross Ref
[69] Wang Xiaolong, Girshick Ross, Gupta Abhinav, and He Kaiming. 2018. Non-local neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 7794–7803.Google ScholarCross Ref
[70] Wang Xiao, Ji Houye, Shi Chuan, Wang Bai, Ye Yanfang, Cui Peng, and Yu Philip S.. 2019. Heterogeneous graph attention network. In The World Wide Web Conference (WWW’19). 2022–2032.Google ScholarDigital Library
[71] Wang Xiaohan, Zhu Linchao, Wu Yu, and Yang Yi. 2020. Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1–13.Google Scholar
[72] Wang Xiaohan, Zhu Linchao, and Yang Yi. 2021. T2VLAD: Global-local sequence alignment for text-video retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 5079–5088.Google ScholarCross Ref
[73] Wang Zihao, Liu Xihui, Li Hongsheng, Sheng Lu, Yan Junjie, Wang Xiaogang, and Shao Jing. 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 5764–5773.Google ScholarCross Ref
[74] Wei Jiwei, Xu Xing, Yang Yang, Ji Yanli, Wang Zheng, and Shen Heng Tao. 2020. Universal weighting metric learning for cross-modal matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 13005–13014.Google ScholarCross Ref
[75] Wray Michael, Larlus Diane, Csurka Gabriela, and Damen Dima. 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 450–459.Google ScholarCross Ref
[76] Wu Aming, Zhu Linchao, Han Yahong, and Yang Yi. 2019. Connective cognition network for directional visual commonsense reasoning. In Advances in Neural Information Processing Systems (NeurIPS’19), Vol. 32. 1–10.Google Scholar
[77] Wu Peng, He Xiangteng, Tang Mingqian, Lv Yiliang, and Liu Jing. 2021. HANet: Hierarchical alignment networks for video-text retrieval. In ACM International Conference on Multimedia (ACM MM’21). 3518–3527.Google ScholarDigital Library
[78] Xie Saining, Girshick Ross, Dollár Piotr, Tu Zhuowen, and He Kaiming. 2017. Aggregated residual transformations for deep neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 1492–1500.Google Scholar
[79] Xu Jun, Mei Tao, Yao Ting, and Rui Yong. 2016. MSR-VTT: A large video description dataset for bridging video and language. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). 5288–5296.Google ScholarCross Ref
[80] Yang Jianwei, Bisk Yonatan, and Gao Jianfeng. 2021. TACo: Token-aware cascade contrastive learning for video-text alignment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 11562–11572.Google Scholar
[81] Yang Wei, Wang Xiaolong, Farhadi Ali, Gupta Abhinav, and Mottaghi Roozbeh. 2019. Visual semantic navigation using scene priors. ICLR (2019), 1–14.Google Scholar
[82] Yang Xun, Dong Jianfeng, Cao Yixin, Wang Xun, Wang Meng, and Chua Tat-Seng. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 1339–1348.Google ScholarDigital Library
[83] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In European Conference on Computer Vision (ECCV’18). 684–699.Google ScholarDigital Library
[84] Yu Weijiang, Zhou Jingwen, Yu Weihao, Liang Xiaodan, and Xiao Nong. 2019. Heterogeneous graph learning for visual commonsense reasoning. In Advances in Neural Information Processing Systems (NeurIPS’19), Vol. 32. 1–10.Google Scholar
[85] Yu Youngjae, Kim Jongseok, and Kim Gunhee. 2018. A joint sequence fusion model for video question answering and retrieval. In European Conference on Computer Vision (ECCV’18). 471–487.Google ScholarDigital Library
[86] Yu Youngjae, Ko Hyungjin, Choi Jongwook, and Kim Gunhee. 2016. Video captioning and retrieval models with semantic attention. In European Conference on Computer Vision (ECCV’16). 1–14.Google Scholar
[87] Yu Youngjae, Ko Hyungjin, Choi Jongwook, and Kim Gunhee. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 3165–3173.Google ScholarCross Ref
[88] Zeiler Matthew D. and Fergus Rob. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV’14). 818–833.Google ScholarCross Ref
[89] Zhang Jianming, Lin Zhe, Brandt Jonathan, Shen Xiaohui, and Sclaroff Stan. 2016. Top-down neural attention by excitation backprop. In European Conference on Computer Vision (ECCV’16). 1084–1102.Google ScholarCross Ref
[90] Zhang Yulun, Li Kunpeng, Li Kai, Wang Lichen, Zhong Bineng, and Fu Yun. 2018. Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision (ECCV’18). 286–301.Google ScholarDigital Library
[91] Zhao Rui, Zheng Kecheng, Zha Zheng-Jun, Xie Hongtao, and Luo Jiebo. 2021. Memory enhanced embedding learning for cross-modal video-text retrieval. arXiv:2103.15686 (2021), 1–9.Google Scholar
[92] Zhou Bolei, Khosla Aditya, Lapedriza Agata, Oliva Aude, and Torralba Antonio. 2016. Learning deep features for discriminative localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). 2921–2929.Google ScholarCross Ref
[93] Zhou Wei, Xia Zhiwu, Dou Peng, Su Tao, and Hu Haifeng. 2022. Double attention based on graph attention network for image multi-label classification. ACM Transactions on Multimedia Computing, Communications, and Applications (2022), 1–22.Google Scholar
[94] Zhu Linchao and Yang Yi. 2020. ActBERT: Learning global-local video-text representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 8746–8755.Google ScholarCross Ref

Index Terms

Guided Graph Attention Learning for Video-Text Matching
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Due to the popularity of video contents on the Internet, the information retrieval between videos and texts has attracted broad interest from researchers, which is a challenging cross-modal retrieval task. A common solution is to learn a joint embedding ...
Read More
Graph Representation Learning: Foundations, Methods, Applications and Systems
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Graphs such as social networks and molecular graphs are ubiquitous data structures in the real world. Due to their prevalence, it is of great research importance to extract meaningful patterns from graph structured data so that downstream tasks can be ...
Read More
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
WWW '19: The World Wide Web Conference

Can neural networks learn to compare graphs without feature engineering? In this paper, we show that it is possible to learn representations for graph similarity with neither domain knowledge nor supervision (i.e. feature engineering or labeled graphs). ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 2s
June 2022
383 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3561949
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 January 2023
- Online AM: 9 September 2022
- Accepted: 6 May 2022
- Revised: 21 March 2022
- Received: 16 November 2021
Published in tomm Volume 18, Issue 2s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cross-modal retrieval
video-text embedding
graph neural networks
multimedia applications
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 338
  Total Downloads
- Downloads (Last 12 months)147
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Guided Graph Attention Learning for Video-Text Matching

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval

Graph Representation Learning: Foundations, Methods, Applications and Systems

DDGK: Learning Graph Representations for Deep Divergence Graph Kernels