skip to main content
10.1145/3581783.3613433acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Video Scene Graph Generation with Spatial-Temporal Knowledge

Published: 27 October 2023 Publication History

Abstract

Various video understanding tasks have been extensively explored in the multimedia community, among which the video scene graph generation (VidSGG) task is more challenging since it requires identifying objects in comprehensive scenes and deducing their relationships. Existing methods for this task generally aggregate object-level visual information from both spatial and temporal perspectives to better learn powerful relationship representations. However, these leading techniques merely implicitly model the spatial-temporal context, which may lead to ambiguous predicate predictions when visual relations vary frequently. In this work, I propose incorporating spatial-temporal knowledge into relation representation learning to effectively constrain the spatial prediction space within each image and sequential variation across temporal frames. To this end, I design a novel spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Extensive experiments conducted on Action Genome demonstrate the effectiveness of the proposed STKET.

References

[1]
Dolly Carrillo, Vivian F López, and María N Moreno. 2013. Multi-label classifi- cation for recommender systems. Trends in Practical Applications of Agents and Multiagent Systems (2013), 181--188.
[2]
Tianshui Chen, Liang Lin, Riquan Chen, Xiaolu Hui, and Hefeng Wu. 2022. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2022), 1371--1384.
[3]
Tianshui Chen, Tao Pu, Hefeng Wu, Yuan Xie, and Liang Lin. 2022. Structured semantic transfer for multi-label recognition with partial labels. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 339--346.
[4]
Tianshui Chen, Tao Pu, Hefeng Wu, Yuan Xie, Lingbo Liu, and Liang Lin. 2022. Cross-Domain Facial Expression Recognition: A Unified Evaluation Benchmark and Adversarial Graph Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 12 (2022), 9887--9903.
[5]
Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163--6171.
[6]
Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16372--16382.
[7]
Daiguo Deng, Ruomei Wang, Hefeng Wu, Huayong He, Qi Li, and Xiaonan Luo. 2018. Learning deep similarity models with focus ranking for fabric image retrieval. Image and Vision Computing 70 (2018), 11--20.
[8]
Mingsheng Fu, Anubha Agrawal, Athirai A. Irissappane, Jie Zhang, Liwei Huang, and Hong Qu. 2022. Deep Reinforcement Learning Framework for Category-Based Item Recommendation. IEEE Transactions on Cybernetics 52, 11 (2022), 12028--12041. https://doi.org/10.1109/TCYB.2021.3089941
[9]
Tengda Han, Weidi Xie, and Andrew Zisserman. 2022. Temporal alignment networks for long-term video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2906--2916.
[10]
Yiqing Huang, Hongwei Xue, Jiansheng Chen, Huimin Ma, and Hongbing Ma. 2021. Semantic Tag Augmented XlanV Model for Video Captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 4818--4822.
[11]
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10236--10247.
[12]
Hanjiang Lai, Pan Yan, Xiangbo Shu, Yunchao Wei, and Shuicheng Yan. 2016. Instance-aware hashing for multi-label image retrieval. IEEE Transactions on Image Processing 25, 6 (2016), 2469--2479.
[13]
Ran Li, YaFei Zhang, Zining Lu, Jianjiang Lu, and Yulong Tian. 2010. Technique of image retrieval based on multi-label image annotation. In 2010 Second International Conference on Multimedia and Information Technology, Vol. 2. IEEE, 10--13.
[14]
Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision. 1261--1270.
[15]
Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2928--2937.
[16]
Yiming Li, Xiaoshan Yang, and Changsheng Xu. 2022. Dynamic Scene Graph Generation via Anticipatory Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13874--13883.
[17]
Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. 2020. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3746--3753.
[18]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual re- lationship detection with language priors. In European conference on computer vision. Springer, 852--869.
[19]
Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, and Shinsuke Mori. 2021. State-aware video procedural captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 1766--1774.
[20]
Tao Pu, Tianshui Chen, Hefeng Wu, and Liang Lin. 2022. Semantic-aware rep- resentation blending for multi-label image recognition with partial labels. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 2091--2098.
[21]
Tao Pu, Tianshui Chen, Yuan Xie, Hefeng Wu, and Liang Lin. 2021. Au-expression knowledge constrained representation learning for facial expression recognition. In 2021 IEEE international conference on robotics and automation (ICRA). IEEE, 11154--11161.
[22]
Tao Pu, Mingzhan Sun, Hefeng Wu, Tianshui Chen, Ling Tian, and Liang Lin. 2023. Semantic Representation and Dependency Learning for Multi-Label Image Recognition. Neurocomputing (2023). https://doi.org/10.1016/j.neucom.2023.01. 018
[23]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6619--6628.
[24]
Yao Teng, Limin Wang, Zhifeng Li, and Gangshan Wu. 2021. Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13688--13697.
[25]
Hefeng Wu, Yafei Hu, Keze Wang, Hanhui Li, Lin Nie, and Hui Cheng. 2019. Instance-aware representation learning and association for online multi-person tracking. Pattern Recognition 94 (2019), 25--34.
[26]
Yuan Xie, Tianshui Chen, Tao Pu, Hefeng Wu, and Liang Lin. 2020. Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition. In Proceedings of the 28th ACM international conference on Multimedia.
[27]
Zekun Yang, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima, and Haruo Takemura. 2020. Bert representations for video question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1556--1565.
[28]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5831--5840.
[29]
Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019. Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11535--11543.
[30]
Zhiwei Zhang, Allen Peng, and Hongsheng Li. 2021. Instance-weighted Central Similarity for Multi-label Image Retrieval. arXiv preprint arXiv:2108.05274 (2021).
[31]
Yong Zheng, Bamshad Mobasher, and Robin Burke. 2014. Context recommendation using multi-label classification. In 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Vol. 2. IEEE, 288--295.

Cited By

View all
  • (2024)Category-Adaptive Label Discovery and Noise Rejection for Multi-Label Recognition With Partial Positive LabelsIEEE Transactions on Multimedia10.1109/TMM.2024.339590126(9591-9602)Online publication date: 1-May-2024

Index Terms

  1. Video Scene Graph Generation with Spatial-Temporal Knowledge

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. dynamic scene graph generation
    2. video understanding
    3. vision and language

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)138
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Category-Adaptive Label Discovery and Noise Rejection for Multi-Label Recognition With Partial Positive LabelsIEEE Transactions on Multimedia10.1109/TMM.2024.339590126(9591-9602)Online publication date: 1-May-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media