skip to main content
10.1145/3474085.3479214acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Published: 17 October 2021 Publication History

Abstract

To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.

Supplementary Material

MP4 File (MM21-gch3339.mp4)
A joint learning method to predict relationship and interaction simultaneously. Based on the relationship and interaction knowledge graph, we can answer different types of queries about deep video understanding, such as filling in the part of graph, multiple choice questions and find target video to match descriptions. Due to the low-number, long-time videos of development set, our method also apply to low shot learning.

References

[1]
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European Conference on Computer Vision. 404--417.
[2]
Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. 2019. Tracking without bells and whistles. In IEEE International Conference on Computer Vision. 941--951.
[3]
Shixing Chen, Xiaohan Nie, David Fan, Dongqing Zhang, Vimal Bhat, and Raffay Hamid. 2021. Shot Contrastive Self-Supervised Learning for Scene Boundary Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 9796--9805.
[4]
Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In International Conference on Multimedia Retrieval. 355--361.
[5]
Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition. 5203--5212.
[6]
J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.
[7]
Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. 2009. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, 10 (2009), 1775--1789.
[8]
Allan D. Jepson, David J. Fleet, and Thomas F. El-Maraghi. 2003. Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25, 10 (2003), 1296--1311.
[9]
Adarsh Kowdle and Tsuhan Chen. 2012. Learning to segment a video to clips based on scene and camera motion. In European Conference on Computer Vision. Springer, 272--286.
[10]
Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning Interactions and Relationships between Movie Characters. In IEEE Conference on Computer Vision and Pattern Recognition. 9849--9858.
[11]
Ali Rahimi, Louis-Philippe Morency, and Trevor Darrell. 2008. Reducing drift in differential tracking. Computer Vision and Image Understanding, Vol. 109, 2 (2008), 97--111.
[12]
Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A Local-to-Global Approach to Multi-modal Movie Scene Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 10146--10155.
[13]
Mrigank Rochan and Yang Wang. 2019. Video Summarization by Learning from Unpaired Data. In IEEE Conference on Computer Vision and Pattern Recognition. 7894--7903.
[14]
Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In ACM international conference on Multimedia. 1300--1308.
[15]
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In ACM International Conference on Multimedia. 423--432.
[16]
Qianru Sun, Bernt Schiele, and Mario Fritz. 2017. A domain based approach to social relation recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 3481--3490.
[17]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In IEEE International Conference on Computer Vision. 4489--4497.
[18]
Jiangyue Xia, Anyi Rao, Qingqiu Huang, Linning Xu, Jiangtao Wen, and Dahua Lin. 2020. Online multi-modal person search in videos. In European Conference on Computer Vision. Springer, 174--190.
[19]
Fan Yu, DanDan Wang, Beibei Zhang, and Tongwei Ren. 2020. Deep Relationship Analysis in Video with Multimodal Feature Fusion. In ACM International Conference on Multimedia. 4640--4644.
[20]
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters (2016), 1499--1503.
[21]
Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. 2020. Tracking objects as points. In European Conference on Computer Vision. 474--490.

Cited By

View all
  • (2023)Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612175(67-76)Online publication date: 26-Oct-2023
  • (2023)Multimodal early fusion operators for temporal video scene segmentation tasksMultimedia Tools and Applications10.1007/s11042-023-14953-682:20(31539-31556)Online publication date: 20-Mar-2023
  • (2022)RETRACTED ARTICLE: ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysisApplied Intelligence10.1007/s10489-022-03343-453:12(16332-16345)Online publication date: 7-Mar-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep video understanding
  2. interaction analysis
  3. multimodal feature fusion
  4. relationship analysis

Qualifiers

  • Short-paper

Funding Sources

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)6
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612175(67-76)Online publication date: 26-Oct-2023
  • (2023)Multimodal early fusion operators for temporal video scene segmentation tasksMultimedia Tools and Applications10.1007/s11042-023-14953-682:20(31539-31556)Online publication date: 20-Mar-2023
  • (2022)RETRACTED ARTICLE: ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysisApplied Intelligence10.1007/s10489-022-03343-453:12(16332-16345)Online publication date: 7-Mar-2022
  • (2022)MT-TCCT: Multi-task Learning for Multimodal Emotion RecognitionArtificial Neural Networks and Machine Learning – ICANN 202210.1007/978-3-031-15934-3_36(429-442)Online publication date: 6-Sep-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media