skip to main content
10.1145/3581783.3612019acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Weakly-supervised Video Scene Graph Generation via Unbiased Cross-modal Learning

Published: 27 October 2023 Publication History

Abstract

Video Scene Graph Generation (VidSGG), which aims to detect the relations between objects in a continuous spatio-temporal environment, has shown great potential in video understanding. Almost all prevailing VidSGG approaches are in a fully-supervised manner where expensive manual annotations are required. Therefore, we introduce a novel and challenging task named Weakly-supervised Video Scene Graph Generation (WS-VidSGG), in which a model is trained with only unlocalized scene graphs as supervisory information. Due to the imbalanced data distribution and the lack of fine-grained annotations, models learned in this setting is prone to be biased. Therefore, we propose an Unbiased Cross-Modal Learning (UCML) framework to address the WS-VidSGG task. Specifically, a cross-modal alignment module is firstly designed for allocating pseudo labels to unlabeled visual objects. We then extract unbiased knowledge from dataset statistics, and utilize prompt to make our model finely comprehend semantic concepts. The learned features that from the prompts and unbiased knowledge reinforced each other, resulting in discriminative textual representations. In order to better explore the relations between visual entities, we design a knowledge-guided attention graph to capture the cross-modal relations. Finally, the learned textual and visual features are integrated into a unified framework for relation prediction. Extensive ablation studies verify the effectiveness of our framework. Moreover, the comparison with state-of-the-art fully-supervised methods shows that our proposed framework also achieves comparable performance. Code https://github.com/ZiyueWu59/UCML is available.

References

[1]
Qianwen Cao, Heyan Huang, Mucheng Ren, and Changsen Yuan. 2022. Concept-Enhanced Relation Network for Video Visual Relation Inference. IEEE Transactions on Circuits and Systems for Video Technology (2022).
[2]
Qianwen Cao, Heyan Huang, Xindi Shang, Boran Wang, and Tat-Seng Chua. 2021. 3-D Relation Network for visual relation recognition in videos. Neurocomputing, Vol. 432 (2021), 91--100.
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I 16. Springer, 213--229.
[4]
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. 2020b. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10800--10809.
[5]
Shuo Chen, Zenglin Shi, Pascal Mettes, and Cees GM Snoek. 2021. Social fabric: Tubelet compositions for video relation detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13485--13494.
[6]
Siqi Chen, Jun Xiao, and Long Chen. 2023 b. Video scene graph generation from single-frame weak supervision. In The Eleventh International Conference on Learning Representations.
[7]
Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163--6171.
[8]
Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. 2020a. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10337--10346.
[9]
Zhanwen Chen, Saed Rezayi, and Sheng Li. 2023 a. More Knowledge, Less Bias: Unbiasing Scene Graph Generation With Explicit Ontological Adjustment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4023--4032.
[10]
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).
[11]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10578--10587.
[12]
Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1769--1779.
[13]
Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li, and Xun Wang. 2022. Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 246--257.
[14]
Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2022a. Fine-Grained Temporal Contrastive Learning for Weakly-Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19999--20009.
[15]
Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2023. Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18827--18836.
[16]
Junyu Gao and Changsheng Xu. 2021. Learning video moment retrieval without a single annotated video. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 3 (2021), 1646--1657.
[17]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019a. Graph convolutional tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4649--4659.
[18]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019b. I Know the Relationships: Zero-Shot Action Recognition via Two-Stream Graph Convolutional Networks and Knowledge Graphs. In AAAI.
[19]
Kaifeng Gao, Long Chen, Yifeng Huang, and Jun Xiao. 2021. Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 4833--4837.
[20]
Kaifeng Gao, Long Chen, Yulei Niu, Jian Shao, and Jun Xiao. 2022b. Classification-then-grounding: Reformulating video scene graphs as temporal bipartite graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19497--19506.
[21]
Arushi Goel, Basura Fernando, Frank Keller, and Hakan Bilen. 2022. Not all relations are equal: Mining informative labels for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15596--15606.
[22]
Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. 2018. Unpaired image captioning by language pivoting. In Proceedings of the European Conference on Computer Vision (ECCV). 503--519.
[23]
Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019. Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1969--1978.
[24]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[25]
Yufan Hu, Junyu Gao, and Changsheng Xu. 2020. Learning dual-pooling graph neural networks for few-shot video classification. IEEE Transactions on Multimedia, Vol. 23 (2020), 4285--4296.
[26]
Drew Hudson and Christopher D Manning. 2019a. Learning by abstraction: The neural state machine. Advances in Neural Information Processing Systems, Vol. 32 (2019).
[27]
Drew A Hudson and Christopher D Manning. 2019b. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700--6709.
[28]
Chenchen Jing, Yuwei Wu, Mingtao Pei, Yao Hu, Yunde Jia, and Qi Wu. 2020. Visual-semantic graph matching for visual grounding. In Proceedings of the 28th ACM International Conference on Multimedia. 4041--4050.
[29]
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3668--3678.
[30]
Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, and Xiaogang Wang. 2017. Object detection in videos with tubelet proposal networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 727--735.
[31]
Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. arXiv preprint arXiv:1609.02907 (2016).
[32]
Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. 2022. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4953--4963.
[33]
Yicong Li, Xun Yang, Xindi Shang, and Tat-Seng Chua. 2021. Interventional video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4091--4099.
[34]
Chenchen Liu, Yang Jin, Kehan Xu, Guoqiang Gong, and Yadong Mu. 2020. Beyond short-term snippet: Video relation detection with spatio-temporal global context. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10840--10849.
[35]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431--3440.
[36]
Alexander Neubeck and Luc Van Gool. 2006. Efficient non-maximum suppression. In 18th international conference on pattern recognition (ICPR'06), Vol. 3. IEEE, 850--855.
[37]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[38]
Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video relation detection with spatio-temporal graph. In Proceedings of the 27th ACM International Conference on Multimedia. 84--93.
[39]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[40]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, Vol. 28 (2015).
[41]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, Vol. 115 (2015), 211--252.
[42]
Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 279--287.
[43]
Xindi Shang, Yicong Li, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2021. Video visual relation detection via iterative inference. In Proceedings of the 29th ACM international conference on Multimedia. 3654--3663.
[44]
Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In Proceedings of the 25th ACM international conference on Multimedia. 1300--1308.
[45]
Jing Shi, Yiwu Zhong, Ning Xu, Yin Li, and Chenliang Xu. 2021. A simple baseline for weakly-supervised scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16393--16402.
[46]
Zixuan Su, Xindi Shang, Jingjing Chen, Yu-Gang Jiang, Zhiyong Qiu, and Tat-Seng Chua. 2020. Video relation detection via multiple hypothesis association. In Proceedings of the 28th ACM International Conference on Multimedia. 3127--3135.
[47]
Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10781--10790.
[48]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3716--3725.
[49]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6619--6628.
[50]
Yao Teng, Limin Wang, Zhifeng Li, and Gangshan Wu. 2021. Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13688--13697.
[51]
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM, Vol. 59, 2 (2016), 64--73.
[52]
Yao-Hung Hubert Tsai, Santosh Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov, and Ali Farhadi. 2019. Video relationship reasoning using gated spatio-temporal energy graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10424--10433.
[53]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[54]
Petar Veli?kovi?, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
[55]
Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP). IEEE, 3645--3649.
[56]
Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 6 (2017), 1367--1381.
[57]
Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. 2020. Visual relation grounding in videos. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VI 16. Springer, 447--464.
[58]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.
[59]
Keren Ye and Adriana Kovashka. 2021. Linguistic structures as weak supervision for visual scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8289--8299.
[60]
Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. 2020a. Bridging knowledge graphs to generate scene graphs. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIII 16. Springer, 606--623.
[61]
Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. 2020b. Weakly supervised visual semantic parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3736--3745.
[62]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5831--5840.
[63]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2881--2890.
[64]
Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, and Xun Wang. 2023. Progressive localization networks for language-based moment localization. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 19, 2 (2023), 1--21.
[65]
Sipeng Zheng, Shizhe Chen, and Qin Jin. 2022. Vrdformer: End-to-end video visual relation detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18836--18846.
[66]
Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, and Yin Li. 2021. Learning to generate scene graph from natural language supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1823--1834.
[67]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022a. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816--16825.
[68]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022b. Learning to prompt for vision-language models. International Journal of Computer Vision, Vol. 130, 9 (2022), 2337--2348.
[69]
Junbao Zhuo, Shuhui Wang, Shuhao Cui, and Qingming Huang. 2019. Unsupervised Open Domain Recognition by Semantic Discrepancy Minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[70]
Junbao Zhuo, Yan Zhu, Shuhao Cui, Shuhui Wang, Bin M A, Qingming Huang, Xiaoming Wei, and Xiaolin Wei. 2022. Zero-Shot Video Classification with Appropriate Web and Task Knowledge Transfer. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 5761--5772. https://doi.org/10.1145/3503161.3548008

Cited By

View all
  • (2024)Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681061(8566-8575)Online publication date: 28-Oct-2024
  • (2024)Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366336820:10(1-22)Online publication date: 12-Sep-2024

Index Terms

  1. Weakly-supervised Video Scene Graph Generation via Unbiased Cross-modal Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. feature learning
    2. unbiased knowledge
    3. video scene graph generation
    4. weakly-supervised

    Qualifiers

    • Research-article

    Funding Sources

    • Open Research Projects of Zhejiang Lab under Grant
    • Beijing Natural Science Foundation under Grant
    • National Key Research and Development Plan of China under Grant
    • National Natural Science Foundation of China under Grants

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)120
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681061(8566-8575)Online publication date: 28-Oct-2024
    • (2024)Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366336820:10(1-22)Online publication date: 12-Sep-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media