skip to main content
10.1145/3394171.3416303acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Deep Relationship Analysis in Video with Multimodal Feature Fusion

Published: 12 October 2020 Publication History

Abstract

In this paper, we propose a novel multimodal feature fusion method based on scene segmentation to detect the relationships between entities in a long duration video. Specifically, a long video is split into some scenes and entities in the scenes are tracked. Text, audio and visual features in a scene are extracted to predict relationships between different entities in the scene. The relationships between entities construct a knowledge graph of the video and can be used to answer some queries about the video. The experimental results show that our method performs well for deep video understanding on the HLVU dataset.

Supplementary Material

MP4 File (3394171.3416303.mp4)
I am Beibei Zhang from Nanjing University. In the video, I will introduce our solution to the deep video understanding task?? Deep Relationship Analysis in Video with Multimodal Feature Fusion.\r\nWe intergrate visual, text and audio features to represent relationship between entities. And then we compute the similarity between the integrated features and the features of predefined relationship descriptions, which are generated in the same way as text features, to obtain the relationship. With the relationships detected by our multimodal feature fusion model, we construct a relationship graph of the video to help complete tasks. You can get the results of our method on each task in the presentation or paper.

References

[1]
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European Conference on Computer Vision. 404--417.
[2]
Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In International Conference on Multimedia Retrieval. 355--361.
[3]
Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou. 2019 a. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
[4]
Jiankang Deng, Jia Guo, Zhou Yuxiang, Jinke Yu, Irene Kotsia, and Stefanos Zafeiriou. 2019 b. RetinaFace: Single-stage Dense Face Localisation in the Wild. arXiv preprint arXiv:1905.00641 (2019).
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
[6]
Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning Interactions and Relationships between Movie Characters. In IEEE Conference on Computer Vision and Pattern Recognition. 9849--9858.
[7]
Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A Local-to-Global Approach to Multi-modal Movie Scene Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 10146--10155.
[8]
Mrigank Rochan and Yang Wang. 2019. Video Summarization by Learning from Unpaired Data. In IEEE Conference on Computer Vision and Pattern Recognition.
[9]
Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video Visual Relation Detection. In ACM International Conference on Multimedia. 1300--1308.
[10]
Jinhui Tang, Xiangbo Shu, Rui Yan, and Liyan Zhang. 2019. Coherence Constrained Graph LSTM for Group Activity Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019), 1--1.
[11]
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters (2016), 1499--1503.
[12]
Xingyi Zhou, Vladlen Koltun, and Philipp Krhenbühl. 2020. Tracking Objects as Points. arXiv preprint arXiv:2004.01177 (2020).

Cited By

View all
  • (2023)Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612175(67-76)Online publication date: 26-Oct-2023
  • (2023)Leveraging spatial residual attention and temporal Markov networks for video action understandingNeural Networks10.1016/j.neunet.2023.10.047Online publication date: Oct-2023
  • (2021)MultiModal Language Modelling on Knowledge Graphs for Deep Video UnderstandingProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479220(4868-4872)Online publication date: 17-Oct-2021
  • Show More Cited By

Index Terms

  1. Deep Relationship Analysis in Video with Multimodal Feature Fusion

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deep video understanding
    2. multimodal analysis
    3. relationship analysis

    Qualifiers

    • Short-paper

    Funding Sources

    • Science,Technology and Innovation Commission of Shenzhen Municipality
    • Natural Science Foundation of Jiangsu Province
    • Collaborative Innovation Center of Novel Software Technology and Industrialization

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612175(67-76)Online publication date: 26-Oct-2023
    • (2023)Leveraging spatial residual attention and temporal Markov networks for video action understandingNeural Networks10.1016/j.neunet.2023.10.047Online publication date: Oct-2023
    • (2021)MultiModal Language Modelling on Knowledge Graphs for Deep Video UnderstandingProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479220(4868-4872)Online publication date: 17-Oct-2021
    • (2021)Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature FusionProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479214(4848-4852)Online publication date: 17-Oct-2021
    • (2021)Learning to Cut by Watching Movies2021 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV48922.2021.00678(6838-6848)Online publication date: Oct-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media