research-article

Relation Understanding in Videos

Authors:

Sipeng Zheng,

Xiangyu Chen,

Shizhe Chen,

Qin JinAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 2662 - 2666

https://doi.org/10.1145/3343031.3356080

Published: 15 October 2019 Publication History

Get Access

Abstract

In this paper, we present our solutions to the grand challenge task "Relation Understanding in Videos" in ACM Multimedia 2019. The challenge task aims to detect instances of target visual relations in a video, where a visual relation instance is represented by a relation triplet <subject, predicate, object> with the trajectories of the subject and object. It seems that it is similar to the image relation detection task with the input changed from images to videos. However, video relation detection requires a much more complex pipeline which not only needs to detect objects in each frame, but also track them in the temporal direction. In this challenge, we follow the basic pipeline structure which consists of 3 main separate components: an object detector, an object tracker and a relation predictor. Based on our analysis, there exist data unbalance and label missing problems in the VidOR dataset. We exploit two simple but effective methods to alleviate these problems. We also use trajectory Non-Maximum Suppression and a sliding window method to address the redundancy of trajectory proposals. Experimental results on the challenge task demonstrate that by applying these proposed approaches over the state-of-the-art relation prediction model, the video relation detection performance can be improved with the precision@1 and precision@5 reaching 0.3305 and 0.3507 respectively.

References

[1]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

Crossref

Google Scholar

[2]

Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3076--3086.

Crossref

Google Scholar

[3]

Kongming Liang, Yuhong Guo, Hong Chang, and Xilin Chen. 2018. Visual relationship detection with deep structural ranking. In Thirty-Second AAAI Conference on Artificial Intelligence.

Google Scholar

[4]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.

Crossref

Google Scholar

[5]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European Conference on Computer Vision. Springer, 852--869.

Crossref

Google Scholar

[6]

Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating Objects and Relations in User-Generated Videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, 279--287.

Digital Library

Google Scholar

[7]

Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. [ACM Press the 2017 ACM - Mountain View, California, USA (2017.10.23-2017.10.27)] Proceedings of the 2017 ACM on Multimedia Conference - MM '17 - Video Visual Relation Detection. 1300--1308.

Google Scholar

[8]

Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. 2018. Singleshot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4203--4212.

Crossref

Google Scholar

Cited By

View all

Zheng ZZhang ZWang YSong SHuang GYang LCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Rethinking the Architecture Design for Efficient Generic Event Boundary DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681513(1215-1224)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681513
Qian RFu ZLiu XZhang KLv ZLan X(2024)Video Visual Relation Detection Based on Trajectory Fusion2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650663(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650663
Li HZhu GZhang LJiang YDang YHou HShen PZhao XShah SBennamoun M(2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
https://doi.org/10.1016/j.neucom.2023.127052
Show More Cited By

Index Terms

Relation Understanding in Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Video Relation Detection with Spatio-Temporal Graph
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

What we perceive from visual content are not only collections of objects but the interactions between them. Visual relations, denoted by the triplet <subject, predicate, object>, could convey a wealth of information for visual understanding. Different ...
Visual Relation Detection with Multi-Level Attention
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Visual relations, which describe various types of interactions between two objects in the image, can provide critical information for comprehensive semantic understanding of the image. Multiple cues related to the objects can contribute to visual ...
Relation Understanding in Videos: A Grand Challenge Overview
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ACM Multimedia 2019 Video Relation Understanding Challenge is the first grand challenge aiming at pushing video content analysis at the relational and structural level. This year, the challenge asks the participants to explore and develop innovative ...

Comments

Information & Contributors

Information

Published In

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Plan
National Natural Science Foundation of China
Beijing Natural Science Foundation

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
308
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zheng ZZhang ZWang YSong SHuang GYang LCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Rethinking the Architecture Design for Efficient Generic Event Boundary DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681513(1215-1224)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681513
Qian RFu ZLiu XZhang KLv ZLan X(2024)Video Visual Relation Detection Based on Trajectory Fusion2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650663(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650663
Li HZhu GZhang LJiang YDang YHou HShen PZhao XShah SBennamoun M(2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
https://doi.org/10.1016/j.neucom.2023.127052
Cao QHuang H(2023)Video Visual Relation Detection With Contextual Knowledge EmbeddingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327032835:12(13083-13095)Online publication date: 1-Dec-2023
https://doi.org/10.1109/TKDE.2023.3270328
Cao QHuang HRen MYuan C(2023)Concept-Enhanced Relation Network for Video Visual Relation InferenceIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.322042633:5(2233-2244)Online publication date: May-2023
https://doi.org/10.1109/TCSVT.2022.3220426
Zheng SChen SJin Q(2022)VRDFormer: End-to-End Video Visual Relation Detection with Transformers2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.01827(18814-18824)Online publication date: Jun-2022
https://doi.org/10.1109/CVPR52688.2022.01827
Qu MCui JSu TDeng GShao W(2022)Video Visual Relation Detection via 3D Convolutional Neural NetworkIEEE Access10.1109/ACCESS.2022.315442310(23748-23756)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3154423
Jung GLee JKim I(2021)Tracklet Pair Proposal and Context Reasoning for Video Scene Graph GenerationSensors10.3390/s2109316421:9(3164)Online publication date: 2-May-2021
https://doi.org/10.3390/s21093164
Ji WLi YWei MShang XXiao JRen TChua TShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)VidVRD 2021Proceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479232(4779-4783)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3479232
Gao KChen LHuang YXiao JShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Video Relation Detection via Tracklet based Visual TransformerProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479231(4833-4837)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3479231
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Video Relation Detection with Spatio-Temporal Graph

Visual Relation Detection with Multi-Level Attention