short-paper

Hybrid Improvements in Multimodal Analysis for Deep Video Understanding

Authors:

Gangshan WuAuthors Info & Claims

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

Article No.: 69, Pages 1 - 5

https://doi.org/10.1145/3469877.3493599

Published: 10 January 2022 Publication History

Abstract

The Deep Video Understanding Challenge (DVU) is a task that focuses on comprehending long duration videos which involve many entities. Its main goal is to build relationship and interaction knowledge graph between entities to answer relevant questions. In this paper, we improved the joint learning method which we previously proposed in many aspects, including few shot learning, optical flow feature, entity recognition, and video description matching. We verified the effectiveness of these measures through experiments.

References

[1]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.

[2]

Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In International Conference on Multimedia Retrieval. 355–361.

Digital Library

[3]

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition. 5203–5212.

[4]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[5]

Clayton Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 8.

[6]

Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Lille.

[7]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012), 1097–1105.

[8]

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065(2016).

[9]

Bing Shuai, Andrew Berneshawi, Xinyu Li, Davide Modolo, and Joseph Tighe. 2021. SiamMOT: Siamese Multi-Object Tracking. In CVPR.

[10]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199(2014).

[11]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).

[12]

Jake Snell, Kevin Swersky, and Richard S Zemel. 2017. Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175(2017).

[13]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In IEEE International Conference on Computer Vision. 4489–4497.

Digital Library

[14]

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, 2016. Matching networks for one shot learning. Advances in neural information processing systems 29 (2016), 3630–3638.

[15]

Jialian Wu, Jiale Cao, Liangchen Song, Yu Wang, Ming Yang, and Junsong Yuan. 2021. Track to Detect and Segment: An Online Multi-Object Tracker. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]

Fan Yu, DanDan Wang, Beibei Zhang, and Tongwei Ren. 2020. Deep Relationship Analysis in Video with Multimodal Feature Fusion. In ACM International Conference on Multimedia. 4640–4644.

[17]

Beibei Zhang, Fan Yu, Yaqun Fang, Tongwei Ren, and Gangshan Wu. 2021. Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion. In ACM International Conference on Multimedia.

[18]

Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. 2020. Tracking objects as points. In European Conference on Computer Vision. 474–490.

Digital Library

Cited By

Liu RFang YYu FTian RRen TWu GEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Deep Video Understanding with Video-Language ModelProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612863(9551-9555)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612863
Zhang BFang YRen TWu GMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Multimodal Analysis for Deep Video Understanding with Video Language TransformerProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3551600(7165-7169)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3551600

Index Terms

Hybrid Improvements in Multimodal Analysis for Deep Video Understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
  2. Machine learning
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Multimodal Analysis for Deep Video Understanding with Video Language Transformer
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

The Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning ...
Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU ...
Deep Relationship Analysis in Video with Multimodal Feature Fusion
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

In this paper, we propose a novel multimodal feature fusion method based on scene segmentation to detect the relationships between entities in a long duration video. Specifically, a long video is split into some scenes and entities in the scenes are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

December 2021

508 pages

ISBN:9781450386074

DOI:10.1145/3469877

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 January 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Funding Sources

Collaborative Innovation Center of Novel Software Technology and Industrialization
National Science Foundation of China
Natural Science Foundation of Jiangsu Province
Science, Technology and Innovation Commission of Shenzhen Municipality

Conference

MMAsia '21

Sponsor:

SIGMM

MMAsia '21: ACM Multimedia Asia

December 1 - 3, 2021

Gold Coast, Australia

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
102
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu RFang YYu FTian RRen TWu GEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Deep Video Understanding with Video-Language ModelProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612863(9551-9555)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612863
Zhang BFang YRen TWu GMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Multimodal Analysis for Deep Video Understanding with Video Language TransformerProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3551600(7165-7169)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3551600

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten