Abstract
As a bridge between videos and natural languages, video-text matching has been a hot multimedia research topic in recent years. Such cross-modal retrieval is usually achieved by learning a common embedding space where videos and text captions are directly comparable. It is still challenging because existing visual representations cannot exploit semantic correlations within videos well, resulting in a mismatch with semantic concepts that are contained in the corresponding text descriptions. In this article, we propose a new Guided Graph Attention Learning (GGAL) model to enhance video embedding learning by capturing important region-level semantic concepts within the spatiotemporal space. Our model builds connections between object regions and performs hierarchical graph reasoning on both frame-level and whole video–level region graphs. During this process, global context is used to guide attention learning on this hierarchical graph topology so that the learned overall video embedding can focus on essential semantic concepts and can be better aligned with text captions. Experiments on commonly used benchmarks validate that GGAL outperforms many recent video-text retrieval methods with a clear margin. As multimedia data in dynamic environments becomes critically important, we also validate GGAL learned video-text representations that can be generalized well to unseen out-of-domain data via cross-dataset evaluations. To further investigate the interpretability of our model, we visualize attention weights learned by GGAL models. We find that GGAL successfully focuses on key semantic concepts in the video and has complementary attention on the context parts based on different ways of building region graphs.
- [1] . 2018. Watch your step: Learning node embeddings via graph attention. Advances in Neural Information Processing Systems (NeurIPS’18) (2018), 9198–9208.Google Scholar
- [2] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 6077–6086.Google ScholarCross Ref
- [3] . 2017. Convolutional random walk networks for semantic image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 858–866.Google ScholarCross Ref
- [4] . 2019. Murel: Multimodal relational reasoning for visual question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1989–1998.Google ScholarCross Ref
- [5] . 2015. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’15). 2956–2964.Google ScholarDigital Library
- [6] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 6299–6308.Google ScholarCross Ref
- [7] . 2017. Dense and low-rank Gaussian CRFs using deep embeddings. In IEEE/CVF International Conference on Computer Vision (ICCV’17). 5103–5112.Google ScholarCross Ref
- [8] . 2011. Collecting highly parallel data for paraphrase evaluation. In Annual Meeting of the Association for Computational Linguistics (ACL’11). 190–200.Google Scholar
- [9] . 2020. Interclass-relativity-adaptive metric learning for cross-modal matching and beyond. IEEE Transactions on Multimedia 23 (2020), 3073–3084.Google ScholarCross Ref
- [10] . 2021. Learning the best pooling strategy for visual semantic embedding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 15789–15798.Google ScholarCross Ref
- [11] . 2021. Mind-the-Gap! Unsupervised domain adaptation for text-video retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 1072–1080.Google ScholarCross Ref
- [12] . 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10638–10647.Google ScholarCross Ref
- [13] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724–1734.Google ScholarCross Ref
- [14] . 2021. Semantic-preserving metric learning for video-text retrieval. In IEEE International Conference on Image Processing (ICIP’21). IEEE, 2388–2392.Google ScholarCross Ref
- [15] . 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv (2014), 1–9.Google Scholar
- [16] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’19). 4171–4186.Google Scholar
- [17] . 2019. Dual encoding for zero-example video retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 9346–9355.Google ScholarCross Ref
- [18] . 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–1.Google Scholar
- [19] . 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In British Machine Vision Conference (BMVC’18). 1–13.Google Scholar
- [20] . 2020. Exploiting visual semantic reasoning for video-text retrieval. In International Joint Conference on Artificial Intelligence (IJCAI 20). 1005–1011.Google ScholarCross Ref
- [21] . 2019. Fusion of multimodal embeddings for ad-hoc video search. In IEEE/CVF International Conference on Computer Vision (ICCV’19) Workshops. 1868–1872.Google ScholarCross Ref
- [22] . 2020. Multi-modal transformer for video retrieval. In European Conference on Computer Vision (ECCV’20). 214–229.Google ScholarDigital Library
- [23] . 2021. CLIP2TV: An empirical study on transformer-based methods for video-text retrieval. arXiv:2111.05610 (2021), 1–17.Google Scholar
- [24] . 2021. Graph attention tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 9543–9552.Google ScholarCross Ref
- [25] . 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena 42, 1-3 (1990), 335–346.Google ScholarDigital Library
- [26] . 2016. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.Google ScholarCross Ref
- [27] . 1993. Interpretation as abduction. Artificial Intelligence 63, 1-2 (1993), 69–142.Google ScholarDigital Library
- [28] . 2018. Squeeze-and-excitation networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 7132–7141.Google ScholarCross Ref
- [29] . 2021. Multi-peak graph-based multi-instance learning for weakly supervised object detection. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 2s (2021), 1–21.Google ScholarDigital Library
- [30] . 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128–3137.Google ScholarCross Ref
- [31] . 2014. Bottom-up and top-down attention: Different processes and overlapping neural systems. The Neuroscientist 20, 5 (2014), 509–521.Google ScholarCross Ref
- [32] . 2017. Temporal tessellation: A unified approach for video analysis. In IEEE/CVF International Conference on Computer Vision (ICCV’17). 94–104.Google ScholarCross Ref
- [33] . 2014. Adam: A method for stochastic optimization. arXiv (2014), 1–15.Google Scholar
- [34] . 2017. Semi-supervised classification with graph convolutional networks. In ICLR’17. 1–14.Google Scholar
- [35] . 2015. Unifying visual-semantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics (2015), 1–13.Google Scholar
- [36] . 2017. Dense-captioning events in videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 706–715.Google ScholarCross Ref
- [37] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123, 1 (2017), 32–73.Google ScholarDigital Library
- [38] . 2011. Random walk inference and learning in a large scale knowledge base. In Conference on Empirical Methods in Natural Language Processing (EMNLP’11). 529–539.Google ScholarDigital Library
- [39] . 2018. Stacked cross attention for image-text matching. In European Conference on Computer Vision (ECCV’18). 201–216.Google ScholarDigital Library
- [40] . 2021. Less is more: ClipBERT for video-and-language learning via sparse sampling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 7331–7341.Google ScholarCross Ref
- [41] . 2020. Screencast tutorial video understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 12526–12535.Google ScholarCross Ref
- [42] . 2019. Guided attention inference network. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 12 (2019), 2996–3010.Google ScholarCross Ref
- [43] . 2019. Visual semantic reasoning for image-text matching. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 4654–4662.Google ScholarCross Ref
- [44] . 2022. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 1–14.Google Scholar
- [45] . 2019. Use what you have: Video retrieval using representations from collaborative experts. In British Machine Vision Conference (BMVC’19). 1–19.Google Scholar
- [46] . 2020. UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353 (2020), 1–16.Google Scholar
- [47] . 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv:2104.08860 (2021), 293–304.Google Scholar
- [48] . 2020. End-to-end learning of visual representations from uncurated instructional videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 9879–9889.Google ScholarCross Ref
- [49] . 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 2630–2640.Google ScholarCross Ref
- [50] . 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ACM International Conference on Multimedia Retrieval (ICMR’18). 19–27.Google ScholarDigital Library
- [51] . 1980. Physical symbol systems. Cognitive Science 4, 2 (1980), 135–183.Google ScholarCross Ref
- [52] . 2018. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems (NeurIPS’18), Vol. 31. 1–10.Google Scholar
- [53] . 2020. Spatio-temporal graph for video captioning with knowledge distillation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10870–10879.Google ScholarCross Ref
- [54] . 2021. A straightforward framework for video retrieval using CLIP. In Mexican Conference on Pattern Recognition (MCPR’21). 3–12.Google ScholarDigital Library
- [55] . 2021. Semantics-aware spatial-temporal binaries for cross-modal video retrieval. IEEE Transactions on Image Processing 30 (2021), 2989–3004.Google ScholarCross Ref
- [56] . 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML’21). 8748–8763.Google Scholar
- [57] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS’15), Vol. 28. 1–9.Google Scholar
- [58] . 2019. Factor graph attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 2039–2048.Google ScholarCross Ref
- [59] . 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In IEEE/CVF International Conference on Computer Vision (ICCV’17). 618–626.Google ScholarCross Ref
- [60] . 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshop’14. 1–8.Google Scholar
- [61] . 2021. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia (2021), 2914–2923.Google Scholar
- [62] . 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1979–1988.Google ScholarCross Ref
- [63] . 2016. Learning language-visual embedding for movie understanding with natural-language. arXiv (2016), 1–13.Google Scholar
- [64] . 2015. Learning spatiotemporal features with 3D convolutional networks. In IEEE/CVF International Conference on Computer Vision (ICCV’15). 4489–4497.Google ScholarDigital Library
- [65] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS’17), Vol. 30. 1–9.Google Scholar
- [66] . 2018. Graph attention networks. In ICLR’18. 1–12.Google Scholar
- [67] . 2015. Sequence to sequence-video to text. In IEEE/CVF International Conference on Computer Vision (ICCV’15). 4534–4542.Google ScholarDigital Library
- [68] . 2019. Graph attention convolution for point cloud semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10296–10305.Google ScholarCross Ref
- [69] . 2018. Non-local neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 7794–7803.Google ScholarCross Ref
- [70] . 2019. Heterogeneous graph attention network. In The World Wide Web Conference (WWW’19). 2022–2032.Google ScholarDigital Library
- [71] . 2020. Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1–13.Google Scholar
- [72] . 2021. T2VLAD: Global-local sequence alignment for text-video retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 5079–5088.Google ScholarCross Ref
- [73] . 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 5764–5773.Google ScholarCross Ref
- [74] . 2020. Universal weighting metric learning for cross-modal matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 13005–13014.Google ScholarCross Ref
- [75] . 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In IEEE/CVF International Conference on Computer Vision (ICCV’19). 450–459.Google ScholarCross Ref
- [76] . 2019. Connective cognition network for directional visual commonsense reasoning. In Advances in Neural Information Processing Systems (NeurIPS’19), Vol. 32. 1–10.Google Scholar
- [77] . 2021. HANet: Hierarchical alignment networks for video-text retrieval. In ACM International Conference on Multimedia (ACM MM’21). 3518–3527.Google ScholarDigital Library
- [78] . 2017. Aggregated residual transformations for deep neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 1492–1500.Google Scholar
- [79] . 2016. MSR-VTT: A large video description dataset for bridging video and language. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). 5288–5296.Google ScholarCross Ref
- [80] . 2021. TACo: Token-aware cascade contrastive learning for video-text alignment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 11562–11572.Google Scholar
- [81] . 2019. Visual semantic navigation using scene priors. ICLR (2019), 1–14.Google Scholar
- [82] . 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 1339–1348.Google ScholarDigital Library
- [83] . 2018. Exploring visual relationship for image captioning. In European Conference on Computer Vision (ECCV’18). 684–699.Google ScholarDigital Library
- [84] . 2019. Heterogeneous graph learning for visual commonsense reasoning. In Advances in Neural Information Processing Systems (NeurIPS’19), Vol. 32. 1–10.Google Scholar
- [85] . 2018. A joint sequence fusion model for video question answering and retrieval. In European Conference on Computer Vision (ECCV’18). 471–487.Google ScholarDigital Library
- [86] . 2016. Video captioning and retrieval models with semantic attention. In European Conference on Computer Vision (ECCV’16). 1–14.Google Scholar
- [87] . 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 3165–3173.Google ScholarCross Ref
- [88] . 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV’14). 818–833.Google ScholarCross Ref
- [89] . 2016. Top-down neural attention by excitation backprop. In European Conference on Computer Vision (ECCV’16). 1084–1102.Google ScholarCross Ref
- [90] . 2018. Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision (ECCV’18). 286–301.Google ScholarDigital Library
- [91] . 2021. Memory enhanced embedding learning for cross-modal video-text retrieval. arXiv:2103.15686 (2021), 1–9.Google Scholar
- [92] . 2016. Learning deep features for discriminative localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). 2921–2929.Google ScholarCross Ref
- [93] . 2022. Double attention based on graph attention network for image multi-label classification. ACM Transactions on Multimedia Computing, Communications, and Applications (2022), 1–22.Google Scholar
- [94] . 2020. ActBERT: Learning global-local video-text representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 8746–8755.Google ScholarCross Ref
Index Terms
- Guided Graph Attention Learning for Video-Text Matching
Recommendations
Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalDue to the popularity of video contents on the Internet, the information retrieval between videos and texts has attracted broad interest from researchers, which is a challenging cross-modal retrieval task. A common solution is to learn a joint embedding ...
Graph Representation Learning: Foundations, Methods, Applications and Systems
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data MiningGraphs such as social networks and molecular graphs are ubiquitous data structures in the real world. Due to their prevalence, it is of great research importance to extract meaningful patterns from graph structured data so that downstream tasks can be ...
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
WWW '19: The World Wide Web ConferenceCan neural networks learn to compare graphs without feature engineering? In this paper, we show that it is possible to learn representations for graph similarity with neither domain knowledge nor supervision (i.e. feature engineering or labeled graphs). ...
Comments