short-paper

Deep Relationship Analysis in Video with Multimodal Feature Fusion

Authors:

Fan Yu,

DanDan Wang,

Beibei Zhang,

Tongwei RenAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 4640 - 4644

https://doi.org/10.1145/3394171.3416303

Published: 12 October 2020 Publication History

Get Access

Abstract

In this paper, we propose a novel multimodal feature fusion method based on scene segmentation to detect the relationships between entities in a long duration video. Specifically, a long video is split into some scenes and entities in the scenes are tracked. Text, audio and visual features in a scene are extracted to predict relationships between different entities in the scene. The relationships between entities construct a knowledge graph of the video and can be used to answer some queries about the video. The experimental results show that our method performs well for deep video understanding on the HLVU dataset.

Supplementary Material

MP4 File (3394171.3416303.mp4)

I am Beibei Zhang from Nanjing University. In the video, I will introduce our solution to the deep video understanding task?? Deep Relationship Analysis in Video with Multimodal Feature Fusion.\r\nWe intergrate visual, text and audio features to represent relationship between entities. And then we compute the similarity between the integrated features and the features of predefined relationship descriptions, which are generated in the same way as text features, to obtain the relationship. With the relationships detected by our multimodal feature fusion model, we construct a relationship graph of the video to help complete tasks. You can get the results of our method on each task in the presentation or paper.

Download
8.93 MB

References

[1]

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European Conference on Computer Vision. 404--417.

Digital Library

Google Scholar

[2]

Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In International Conference on Multimedia Retrieval. 355--361.

Digital Library

Google Scholar

[3]

Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou. 2019 a. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In IEEE Conference on Computer Vision and Pattern Recognition.

Google Scholar

[4]

Jiankang Deng, Jia Guo, Zhou Yuxiang, Jinke Yu, Irene Kotsia, and Stefanos Zafeiriou. 2019 b. RetinaFace: Single-stage Dense Face Localisation in the Wild. arXiv preprint arXiv:1905.00641 (2019).

Google Scholar

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).

Google Scholar

[6]

Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning Interactions and Relationships between Movie Characters. In IEEE Conference on Computer Vision and Pattern Recognition. 9849--9858.

Crossref

Google Scholar

[7]

Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A Local-to-Global Approach to Multi-modal Movie Scene Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 10146--10155.

Crossref

Google Scholar

[8]

Mrigank Rochan and Yang Wang. 2019. Video Summarization by Learning from Unpaired Data. In IEEE Conference on Computer Vision and Pattern Recognition.

Crossref

Google Scholar

[9]

Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video Visual Relation Detection. In ACM International Conference on Multimedia. 1300--1308.

Google Scholar

[10]

Jinhui Tang, Xiangbo Shu, Rui Yan, and Liyan Zhang. 2019. Coherence Constrained Graph LSTM for Group Activity Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019), 1--1.

Google Scholar

[11]

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters (2016), 1499--1503.

Google Scholar

[12]

Xingyi Zhou, Vladlen Koltun, and Philipp Krhenbühl. 2020. Tracking Objects as Points. arXiv preprint arXiv:2004.01177 (2020).

Google Scholar

Cited By

View all

Wang HHu YZhu YQi JWu BEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612175(67-76)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612175
Xu YWang ZZhang X(2023)Leveraging spatial residual attention and temporal Markov networks for video action understandingNeural Networks10.1016/j.neunet.2023.10.047Online publication date: Oct-2023
https://doi.org/10.1016/j.neunet.2023.10.047
Anand VRamesh RJin BWang ZLei XLin CShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)MultiModal Language Modelling on Knowledge Graphs for Deep Video UnderstandingProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479220(4868-4872)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3479220
Show More Cited By

Index Terms

Deep Relationship Analysis in Video with Multimodal Feature Fusion
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Deep Video Understanding of Character Relationships in Movies
ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal Interaction

Humans can easily understand storylines and character relationships in movies. However, the automatic relationship analysis from videos is challenging. In this paper, we introduce a deep video understanding system to infer relationships between movie ...
Multimodal Analysis for Deep Video Understanding with Video Language Transformer
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

The Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning ...
Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU ...

Comments

Information & Contributors

Information

Published In

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Science,Technology and Innovation Commission of Shenzhen Municipality
Natural Science Foundation of Jiangsu Province
Collaborative Innovation Center of Novel Software Technology and Industrialization

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
406
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wang HHu YZhu YQi JWu BEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612175(67-76)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612175
Xu YWang ZZhang X(2023)Leveraging spatial residual attention and temporal Markov networks for video action understandingNeural Networks10.1016/j.neunet.2023.10.047Online publication date: Oct-2023
https://doi.org/10.1016/j.neunet.2023.10.047
Anand VRamesh RJin BWang ZLei XLin CShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)MultiModal Language Modelling on Knowledge Graphs for Deep Video UnderstandingProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479220(4868-4872)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3479220
Zhang BYu FGao YRen TWu GShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature FusionProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479214(4848-4852)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3479214
Pardo AHeilbron FLeon Alcazar JThabet AGhanem B(2021)Learning to Cut by Watching Movies2021 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV48922.2021.00678(6838-6848)Online publication date: Oct-2021
https://doi.org/10.1109/ICCV48922.2021.00678

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Deep Video Understanding of Character Relationships in Movies

Multimodal Analysis for Deep Video Understanding with Video Language Transformer

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations