short-paper

Video Scene Graph Generation with Spatial-Temporal Knowledge

Author:

Tao PuAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9340 - 9344

https://doi.org/10.1145/3581783.3613433

Published: 27 October 2023 Publication History

Abstract

Various video understanding tasks have been extensively explored in the multimedia community, among which the video scene graph generation (VidSGG) task is more challenging since it requires identifying objects in comprehensive scenes and deducing their relationships. Existing methods for this task generally aggregate object-level visual information from both spatial and temporal perspectives to better learn powerful relationship representations. However, these leading techniques merely implicitly model the spatial-temporal context, which may lead to ambiguous predicate predictions when visual relations vary frequently. In this work, I propose incorporating spatial-temporal knowledge into relation representation learning to effectively constrain the spatial prediction space within each image and sequential variation across temporal frames. To this end, I design a novel spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Extensive experiments conducted on Action Genome demonstrate the effectiveness of the proposed STKET.

References

[1]

Dolly Carrillo, Vivian F López, and María N Moreno. 2013. Multi-label classifi- cation for recommender systems. Trends in Practical Applications of Agents and Multiagent Systems (2013), 181--188.

[2]

Tianshui Chen, Liang Lin, Riquan Chen, Xiaolu Hui, and Hefeng Wu. 2022. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2022), 1371--1384.

[3]

Tianshui Chen, Tao Pu, Hefeng Wu, Yuan Xie, and Liang Lin. 2022. Structured semantic transfer for multi-label recognition with partial labels. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 339--346.

[4]

Tianshui Chen, Tao Pu, Hefeng Wu, Yuan Xie, Lingbo Liu, and Liang Lin. 2022. Cross-Domain Facial Expression Recognition: A Unified Evaluation Benchmark and Adversarial Graph Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 12 (2022), 9887--9903.

[5]

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163--6171.

[6]

Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16372--16382.

[7]

Daiguo Deng, Ruomei Wang, Hefeng Wu, Huayong He, Qi Li, and Xiaonan Luo. 2018. Learning deep similarity models with focus ranking for fabric image retrieval. Image and Vision Computing 70 (2018), 11--20.

[8]

Mingsheng Fu, Anubha Agrawal, Athirai A. Irissappane, Jie Zhang, Liwei Huang, and Hong Qu. 2022. Deep Reinforcement Learning Framework for Category-Based Item Recommendation. IEEE Transactions on Cybernetics 52, 11 (2022), 12028--12041. https://doi.org/10.1109/TCYB.2021.3089941

[9]

Tengda Han, Weidi Xie, and Andrew Zisserman. 2022. Temporal alignment networks for long-term video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2906--2916.

[10]

Yiqing Huang, Hongwei Xue, Jiansheng Chen, Huimin Ma, and Hongbing Ma. 2021. Semantic Tag Augmented XlanV Model for Video Captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 4818--4822.

Digital Library

[11]

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10236--10247.

[12]

Hanjiang Lai, Pan Yan, Xiangbo Shu, Yunchao Wei, and Shuicheng Yan. 2016. Instance-aware hashing for multi-label image retrieval. IEEE Transactions on Image Processing 25, 6 (2016), 2469--2479.

Digital Library

[13]

Ran Li, YaFei Zhang, Zining Lu, Jianjiang Lu, and Yulong Tian. 2010. Technique of image retrieval based on multi-label image annotation. In 2010 Second International Conference on Multimedia and Information Technology, Vol. 2. IEEE, 10--13.

Digital Library

[14]

Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision. 1261--1270.

[15]

Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2928--2937.

[16]

Yiming Li, Xiaoshan Yang, and Changsheng Xu. 2022. Dynamic Scene Graph Generation via Anticipatory Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13874--13883.

[17]

Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. 2020. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3746--3753.

[18]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual re- lationship detection with language priors. In European conference on computer vision. Springer, 852--869.

[19]

Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, and Shinsuke Mori. 2021. State-aware video procedural captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 1766--1774.

Digital Library

[20]

Tao Pu, Tianshui Chen, Hefeng Wu, and Liang Lin. 2022. Semantic-aware rep- resentation blending for multi-label image recognition with partial labels. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 2091--2098.

[21]

Tao Pu, Tianshui Chen, Yuan Xie, Hefeng Wu, and Liang Lin. 2021. Au-expression knowledge constrained representation learning for facial expression recognition. In 2021 IEEE international conference on robotics and automation (ICRA). IEEE, 11154--11161.

Digital Library

[22]

Tao Pu, Mingzhan Sun, Hefeng Wu, Tianshui Chen, Ling Tian, and Liang Lin. 2023. Semantic Representation and Dependency Learning for Multi-Label Image Recognition. Neurocomputing (2023). https://doi.org/10.1016/j.neucom.2023.01. 018

[23]

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6619--6628.

[24]

Yao Teng, Limin Wang, Zhifeng Li, and Gangshan Wu. 2021. Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13688--13697.

[25]

Hefeng Wu, Yafei Hu, Keze Wang, Hanhui Li, Lin Nie, and Hui Cheng. 2019. Instance-aware representation learning and association for online multi-person tracking. Pattern Recognition 94 (2019), 25--34.

Digital Library

[26]

Yuan Xie, Tianshui Chen, Tao Pu, Hefeng Wu, and Liang Lin. 2020. Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition. In Proceedings of the 28th ACM international conference on Multimedia.

Digital Library

[27]

Zekun Yang, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima, and Haruo Takemura. 2020. Bert representations for video question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1556--1565.

[28]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5831--5840.

[29]

Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019. Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11535--11543.

[30]

Zhiwei Zhang, Allen Peng, and Hongsheng Li. 2021. Instance-weighted Central Similarity for Multi-label Image Retrieval. arXiv preprint arXiv:2108.05274 (2021).

[31]

Yong Zheng, Bamshad Mobasher, and Robin Burke. 2014. Context recommendation using multi-label classification. In 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Vol. 2. IEEE, 288--295.

Digital Library

Cited By

Pu TLao QWu HChen TTian LLiu JLin L(2024)Category-Adaptive Label Discovery and Noise Rejection for Multi-Label Recognition With Partial Positive LabelsIEEE Transactions on Multimedia10.1109/TMM.2024.339590126(9591-9602)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3395901

Index Terms

Video Scene Graph Generation with Spatial-Temporal Knowledge
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Temporal Scene Montage for Self-Supervised Video Scene Boundary Detection
Once a video sequence is organized as basic shot units, it is of great interest to temporally link shots into semantic-compact scene segments to facilitate long video understanding. However, it still challenges existing video scene boundary detection ...
Boosting Scene Graph Generation with Visual Relation Saliency
The scene graph is a symbolic data structure that comprehensively describes the objects and visual relations in a visual scene, while ignoring the inherent perceptual saliency of each visual relation (i.e., relation saliency). However, humans often ...
One-shot Scene Graph Generation
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

As a structured representation of the image content, the visual scene graph (visual relationship) acts as a bridge between computer vision and natural language processing. Existing models on the scene graph generation task notoriously require tens or ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Guangdong Basic and Applied Basic Reserach Foundation
National Natural Science Foundation of China
National Key R&D Program of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
246
Total Downloads

Downloads (Last 12 months)138
Downloads (Last 6 weeks)16

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pu TLao QWu HChen TTian LLiu JLin L(2024)Category-Adaptive Label Discovery and Noise Rejection for Multi-Label Recognition With Partial Positive LabelsIEEE Transactions on Multimedia10.1109/TMM.2024.339590126(9591-9602)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3395901

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten