short-paper

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Authors:

Gangshan WuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 4848 - 4852

https://doi.org/10.1145/3474085.3479214

Published: 17 October 2021 Publication History

Abstract

To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.

Supplementary Material

MP4 File (MM21-gch3339.mp4)

A joint learning method to predict relationship and interaction simultaneously. Based on the relationship and interaction knowledge graph, we can answer different types of queries about deep video understanding, such as filling in the part of graph, multiple choice questions and find target video to match descriptions. Due to the low-number, long-time videos of development set, our method also apply to low shot learning.

Download
18.70 MB

References

[1]

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European Conference on Computer Vision. 404--417.

Digital Library

[2]

Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. 2019. Tracking without bells and whistles. In IEEE International Conference on Computer Vision. 941--951.

[3]

Shixing Chen, Xiaohan Nie, David Fan, Dongqing Zhang, Vimal Bhat, and Raffay Hamid. 2021. Shot Contrastive Self-Supervised Learning for Scene Boundary Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 9796--9805.

[4]

Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In International Conference on Multimedia Retrieval. 355--361.

Digital Library

[5]

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition. 5203--5212.

[6]

J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.

[7]

Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. 2009. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, 10 (2009), 1775--1789.

Digital Library

[8]

Allan D. Jepson, David J. Fleet, and Thomas F. El-Maraghi. 2003. Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25, 10 (2003), 1296--1311.

Digital Library

[9]

Adarsh Kowdle and Tsuhan Chen. 2012. Learning to segment a video to clips based on scene and camera motion. In European Conference on Computer Vision. Springer, 272--286.

Digital Library

[10]

Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning Interactions and Relationships between Movie Characters. In IEEE Conference on Computer Vision and Pattern Recognition. 9849--9858.

[11]

Ali Rahimi, Louis-Philippe Morency, and Trevor Darrell. 2008. Reducing drift in differential tracking. Computer Vision and Image Understanding, Vol. 109, 2 (2008), 97--111.

Digital Library

[12]

Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A Local-to-Global Approach to Multi-modal Movie Scene Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 10146--10155.

[13]

Mrigank Rochan and Yang Wang. 2019. Video Summarization by Learning from Unpaired Data. In IEEE Conference on Computer Vision and Pattern Recognition. 7894--7903.

[14]

Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In ACM international conference on Multimedia. 1300--1308.

Digital Library

[15]

Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In ACM International Conference on Multimedia. 423--432.

Digital Library

[16]

Qianru Sun, Bernt Schiele, and Mario Fritz. 2017. A domain based approach to social relation recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 3481--3490.

[17]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In IEEE International Conference on Computer Vision. 4489--4497.

Digital Library

[18]

Jiangyue Xia, Anyi Rao, Qingqiu Huang, Linning Xu, Jiangtao Wen, and Dahua Lin. 2020. Online multi-modal person search in videos. In European Conference on Computer Vision. Springer, 174--190.

Digital Library

[19]

Fan Yu, DanDan Wang, Beibei Zhang, and Tongwei Ren. 2020. Deep Relationship Analysis in Video with Multimodal Feature Fusion. In ACM International Conference on Multimedia. 4640--4644.

Digital Library

[20]

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters (2016), 1499--1503.

[21]

Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. 2020. Tracking objects as points. In European Conference on Computer Vision. 474--490.

Digital Library

Cited By

Wang HHu YZhu YQi JWu BEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612175(67-76)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612175
Beserra AGoularte R(2023)Multimodal early fusion operators for temporal video scene segmentation tasksMultimedia Tools and Applications10.1007/s11042-023-14953-682:20(31539-31556)Online publication date: 20-Mar-2023
https://dl.acm.org/doi/10.1007/s11042-023-14953-6
Zhang QShi LLiu PZhu ZXu L(2022)RETRACTED ARTICLE: ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysisApplied Intelligence10.1007/s10489-022-03343-453:12(16332-16345)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1007/s10489-022-03343-4
Show More Cited By

Recommendations

Deep Relationship Analysis in Video with Multimodal Feature Fusion
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

In this paper, we propose a novel multimodal feature fusion method based on scene segmentation to detect the relationships between entities in a long duration video. Specifically, a long video is split into some scenes and entities in the scenes are ...
MultiModal Language Modelling on Knowledge Graphs for Deep Video Understanding
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

The natural language processing community has had a major interest in auto-regressive [4, 13] and span-prediction based language models [7] recently, while knowledge graphs are often referenced for common-sense based reasoning and fact-checking models. ...
Multimodal Analysis for Deep Video Understanding with Video Language Transformer
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

The Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Science, Technology and Innovation Commission of Shenzhen Municipality
Natural Science Foundation of Jiangsu Province
National Science Foundation of China
Collaborative Innovation Center of Novel Software Technology and Industrialization

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
428
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)6

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang HHu YZhu YQi JWu BEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612175(67-76)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612175
Beserra AGoularte R(2023)Multimodal early fusion operators for temporal video scene segmentation tasksMultimedia Tools and Applications10.1007/s11042-023-14953-682:20(31539-31556)Online publication date: 20-Mar-2023
https://dl.acm.org/doi/10.1007/s11042-023-14953-6
Zhang QShi LLiu PZhu ZXu L(2022)RETRACTED ARTICLE: ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysisApplied Intelligence10.1007/s10489-022-03343-453:12(16332-16345)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1007/s10489-022-03343-4
Wang YChen ZChen SZhu Y(2022)MT-TCCT: Multi-task Learning for Multimodal Emotion RecognitionArtificial Neural Networks and Machine Learning – ICANN 202210.1007/978-3-031-15934-3_36(429-442)Online publication date: 6-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-15934-3_36

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten