research-article

Weakly-supervised Video Scene Graph Generation via Unbiased Cross-modal Learning

Authors:

Changsheng XuAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4574 - 4583

https://doi.org/10.1145/3581783.3612019

Published: 27 October 2023 Publication History

Abstract

Video Scene Graph Generation (VidSGG), which aims to detect the relations between objects in a continuous spatio-temporal environment, has shown great potential in video understanding. Almost all prevailing VidSGG approaches are in a fully-supervised manner where expensive manual annotations are required. Therefore, we introduce a novel and challenging task named Weakly-supervised Video Scene Graph Generation (WS-VidSGG), in which a model is trained with only unlocalized scene graphs as supervisory information. Due to the imbalanced data distribution and the lack of fine-grained annotations, models learned in this setting is prone to be biased. Therefore, we propose an Unbiased Cross-Modal Learning (UCML) framework to address the WS-VidSGG task. Specifically, a cross-modal alignment module is firstly designed for allocating pseudo labels to unlabeled visual objects. We then extract unbiased knowledge from dataset statistics, and utilize prompt to make our model finely comprehend semantic concepts. The learned features that from the prompts and unbiased knowledge reinforced each other, resulting in discriminative textual representations. In order to better explore the relations between visual entities, we design a knowledge-guided attention graph to capture the cross-modal relations. Finally, the learned textual and visual features are integrated into a unified framework for relation prediction. Extensive ablation studies verify the effectiveness of our framework. Moreover, the comparison with state-of-the-art fully-supervised methods shows that our proposed framework also achieves comparable performance. Code https://github.com/ZiyueWu59/UCML is available.

References

[1]

Qianwen Cao, Heyan Huang, Mucheng Ren, and Changsen Yuan. 2022. Concept-Enhanced Relation Network for Video Visual Relation Inference. IEEE Transactions on Circuits and Systems for Video Technology (2022).

[2]

Qianwen Cao, Heyan Huang, Xindi Shang, Boran Wang, and Tat-Seng Chua. 2021. 3-D Relation Network for visual relation recognition in videos. Neurocomputing, Vol. 432 (2021), 91--100.

[3]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I 16. Springer, 213--229.

Digital Library

[4]

Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. 2020b. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10800--10809.

[5]

Shuo Chen, Zenglin Shi, Pascal Mettes, and Cees GM Snoek. 2021. Social fabric: Tubelet compositions for video relation detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13485--13494.

[6]

Siqi Chen, Jun Xiao, and Long Chen. 2023 b. Video scene graph generation from single-frame weak supervision. In The Eleventh International Conference on Learning Representations.

[7]

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163--6171.

[8]

Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. 2020a. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10337--10346.

[9]

Zhanwen Chen, Saed Rezayi, and Sheng Li. 2023 a. More Knowledge, Less Bias: Unbiasing Scene Graph Generation With Explicit Ontological Adjustment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4023--4032.

[10]

Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).

[11]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10578--10587.

[12]

Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1769--1779.

[13]

Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li, and Xun Wang. 2022. Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 246--257.

Digital Library

[14]

Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2022a. Fine-Grained Temporal Contrastive Learning for Weakly-Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19999--20009.

[15]

Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2023. Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18827--18836.

[16]

Junyu Gao and Changsheng Xu. 2021. Learning video moment retrieval without a single annotated video. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 3 (2021), 1646--1657.

[17]

Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019a. Graph convolutional tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4649--4659.

[18]

Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019b. I Know the Relationships: Zero-Shot Action Recognition via Two-Stream Graph Convolutional Networks and Knowledge Graphs. In AAAI.

[19]

Kaifeng Gao, Long Chen, Yifeng Huang, and Jun Xiao. 2021. Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 4833--4837.

Digital Library

[20]

Kaifeng Gao, Long Chen, Yulei Niu, Jian Shao, and Jun Xiao. 2022b. Classification-then-grounding: Reformulating video scene graphs as temporal bipartite graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19497--19506.

[21]

Arushi Goel, Basura Fernando, Frank Keller, and Hakan Bilen. 2022. Not all relations are equal: Mining informative labels for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15596--15606.

[22]

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. 2018. Unpaired image captioning by language pivoting. In Proceedings of the European Conference on Computer Vision (ECCV). 503--519.

Digital Library

[23]

Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019. Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1969--1978.

[24]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[25]

Yufan Hu, Junyu Gao, and Changsheng Xu. 2020. Learning dual-pooling graph neural networks for few-shot video classification. IEEE Transactions on Multimedia, Vol. 23 (2020), 4285--4296.

[26]

Drew Hudson and Christopher D Manning. 2019a. Learning by abstraction: The neural state machine. Advances in Neural Information Processing Systems, Vol. 32 (2019).

[27]

Drew A Hudson and Christopher D Manning. 2019b. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700--6709.

[28]

Chenchen Jing, Yuwei Wu, Mingtao Pei, Yao Hu, Yunde Jia, and Qi Wu. 2020. Visual-semantic graph matching for visual grounding. In Proceedings of the 28th ACM International Conference on Multimedia. 4041--4050.

Digital Library

[29]

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3668--3678.

[30]

Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, and Xiaogang Wang. 2017. Object detection in videos with tubelet proposal networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 727--735.

[31]

Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. arXiv preprint arXiv:1609.02907 (2016).

[32]

Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. 2022. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4953--4963.

[33]

Yicong Li, Xun Yang, Xindi Shang, and Tat-Seng Chua. 2021. Interventional video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4091--4099.

Digital Library

[34]

Chenchen Liu, Yang Jin, Kehan Xu, Guoqiang Gong, and Yadong Mu. 2020. Beyond short-term snippet: Video relation detection with spatio-temporal global context. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10840--10849.

[35]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431--3440.

[36]

Alexander Neubeck and Luc Van Gool. 2006. Efficient non-maximum suppression. In 18th international conference on pattern recognition (ICPR'06), Vol. 3. IEEE, 850--855.

Digital Library

[37]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[38]

Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video relation detection with spatio-temporal graph. In Proceedings of the 27th ACM International Conference on Multimedia. 84--93.

Digital Library

[39]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[40]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, Vol. 28 (2015).

[41]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, Vol. 115 (2015), 211--252.

[42]

Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 279--287.

Digital Library

[43]

Xindi Shang, Yicong Li, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2021. Video visual relation detection via iterative inference. In Proceedings of the 29th ACM international conference on Multimedia. 3654--3663.

Digital Library

[44]

Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In Proceedings of the 25th ACM international conference on Multimedia. 1300--1308.

Digital Library

[45]

Jing Shi, Yiwu Zhong, Ning Xu, Yin Li, and Chenliang Xu. 2021. A simple baseline for weakly-supervised scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16393--16402.

[46]

Zixuan Su, Xindi Shang, Jingjing Chen, Yu-Gang Jiang, Zhiyong Qiu, and Tat-Seng Chua. 2020. Video relation detection via multiple hypothesis association. In Proceedings of the 28th ACM International Conference on Multimedia. 3127--3135.

Digital Library

[47]

Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10781--10790.

[48]

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3716--3725.

[49]

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6619--6628.

[50]

Yao Teng, Limin Wang, Zhifeng Li, and Gangshan Wu. 2021. Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13688--13697.

[51]

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM, Vol. 59, 2 (2016), 64--73.

Digital Library

[52]

Yao-Hung Hubert Tsai, Santosh Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov, and Ali Farhadi. 2019. Video relationship reasoning using gated spatio-temporal energy graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10424--10433.

[53]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).

[54]

Petar Veli?kovi?, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).

[55]

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP). IEEE, 3645--3649.

Digital Library

[56]

Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 6 (2017), 1367--1381.

[57]

Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. 2020. Visual relation grounding in videos. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VI 16. Springer, 447--464.

Digital Library

[58]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.

[59]

Keren Ye and Adriana Kovashka. 2021. Linguistic structures as weak supervision for visual scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8289--8299.

[60]

Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. 2020a. Bridging knowledge graphs to generate scene graphs. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIII 16. Springer, 606--623.

Digital Library

[61]

Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. 2020b. Weakly supervised visual semantic parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3736--3745.

[62]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5831--5840.

[63]

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2881--2890.

[64]

Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, and Xun Wang. 2023. Progressive localization networks for language-based moment localization. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 19, 2 (2023), 1--21.

Digital Library

[65]

Sipeng Zheng, Shizhe Chen, and Qin Jin. 2022. Vrdformer: End-to-end video visual relation detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18836--18846.

[66]

Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, and Yin Li. 2021. Learning to generate scene graph from natural language supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1823--1834.

[67]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022a. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816--16825.

[68]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022b. Learning to prompt for vision-language models. International Journal of Computer Vision, Vol. 130, 9 (2022), 2337--2348.

Digital Library

[69]

Junbao Zhuo, Shuhui Wang, Shuhao Cui, and Qingming Huang. 2019. Unsupervised Open Domain Recognition by Semantic Discrepancy Minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]

Junbao Zhuo, Yan Zhu, Shuhao Cui, Shuhui Wang, Bin M A, Qingming Huang, Xiaoming Wei, and Xiaolin Wei. 2022. Zero-Shot Video Classification with Appropriate Web and Task Knowledge Transfer. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 5761--5772. https://doi.org/10.1145/3503161.3548008

Digital Library

Cited By

Wu ZGao JXu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681061(8566-8575)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681061
Wu ZGao JHuang SXu C(2024)Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366336820:10(1-22)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1145/3663368

Index Terms

Weakly-supervised Video Scene Graph Generation via Unbiased Cross-modal Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Boosted MIML method for weakly-supervised image semantic segmentation

Weakly-supervised image semantic segmentation aims to segment images into semantically consistent regions with only image-level labels are available, and is of great significance for fine-grained image analysis, retrieval and other possible ...
Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Video moment retrieval aims to localize the most relevant video moment given the text query. Weakly supervised approaches leverage video-text pairs only for training, without temporal annotations. Most current methods align the proposed video moment and ...
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
Computer Vision – ECCV 2020
Abstract
In this paper, we introduce a new problem, named audio-visual video parsing, which aims to parse a video into temporal event segments and label them as either audible, visible, or both. Such a problem is essential for a complete understanding of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Open Research Projects of Zhejiang Lab under Grant
Beijing Natural Science Foundation under Grant
National Key Research and Development Plan of China under Grant
National Natural Science Foundation of China under Grants

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
251
Total Downloads

Downloads (Last 12 months)120
Downloads (Last 6 weeks)20

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu ZGao JXu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681061(8566-8575)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681061
Wu ZGao JHuang SXu C(2024)Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366336820:10(1-22)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1145/3663368

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten