research-article

Deep Video Understanding with Video-Language Model

Authors:

Fan Yu,

Gangshan WuAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9551 - 9555

https://doi.org/10.1145/3581783.3612863

Published: 27 October 2023 Publication History

Get Access

Abstract

Pre-trained video-language models (VLMs) have shown superior performance in high-level video understanding tasks, analyzing multi-modal information, aligning with Deep Video Understanding Challenge (DVUC) requirements.In this paper, we explore pre-trained VLMs' potential in multimodal question answering for long-form videos. We propose a solution called Dual Branches Video Modeling (DBVM), which combines knowledge graph (KG) and VLMs, leveraging their strengths and addressing shortcomings.The KG branch recognizes and localizes entities, fuses multimodal features at different levels, and constructs KGs with entities as nodes and relationships as edges.The VLM branch applies a selection strategy to adapt input movies into acceptable length and a cross-matching strategy to post-process results providing accurate scene descriptions.Experiments conducted on the DVUC dataset validate the effectiveness of our DBVM.

Supplemental Material

MP4 File

Presentation Video of paper 'Deep Video Understanding with Video-Language Model' of deep video understanding challenge. The video content includes introduction about Deep Video Understanding Challenge, framework and detail of our work and experiments about our methods.

Download
23.22 MB

References

[1]

Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In International Conference on Multimedia Retrieval. 355--361.

Digital Library

Google Scholar

[2]

Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, and Mike Zheng Shou. 2023. MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14773--14783.

Google Scholar

[3]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293--304.

Digital Library

Google Scholar

[4]

Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. 2022. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision. Springer, 1--18.

Digital Library

Google Scholar

[5]

Penggang Qin, Jiarui Yu, Yan Gao, Derong Xu, Yunkai Chen, Shiwei Wu, Tong Xu, Enhong Chen, and Yanbin Hao. 2022. Unified QA-aware knowledge graph generation based on multi-modal modeling. In The 30th ACM International Conference on Multimedia. 7185--7189.

Digital Library

Google Scholar

[6]

Raksha Ramesh, Vishal Anand, Zifan Chen, Yifei Dong, Yun Chen, and Ching-Yung Lin. 2022. Leveraging Text Representation and Face-head Tracking for Long- form Multimodal Semantic Relation Understanding. In The 30th ACM International Conference on Multimedia. 7215--7219.

Digital Library

Google Scholar

[7]

Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A local-to-global approach to multi-modal movie scene segmentation. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10146--10155.

Crossref

Google Scholar

[8]

Siyang Sun, Xiong Xiong, and Yun Zheng. 2022. Two stage Multi-Modal Modeling for Video Interaction Analysis in Deep Video Understanding Challenge. In The 30th ACM International Conference on Multimedia. 7040--7044.

Digital Library

Google Scholar

[9]

Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, and Jianlong Fu. 2022. Long-form video-language pre-training with multimodal temporal contrastive learning. Advances in neural information processing systems 35 (2022), 38032--38045.

Google Scholar

[10]

Chen-Wei Xie, Siyang Sun, Liming Zhao, Jianmin Wu, Dangwei Li, and Yun Zheng. 2022. Deep Video Understanding with a Unified Multi-Modal Retrieval Framework. In The 30th ACM International Conference on Multimedia. 7055--7059.

Digital Library

Google Scholar

[11]

Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, and Luke Zettlemoyer. 2021. Vlm: Task- agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996 (2021).

Google Scholar

[12]

Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. 2023. Self-Chained Image-Language Model for Video Localization and Question Answering. arXiv preprint arXiv:2305.06988 (2023).

Google Scholar

[13]

Beibei Zhang, Yaqun Fang, Tongwei Ren, and Gangshan Wu. 2022. Multimodal Analysis for Deep Video Understanding with Video Language Transformer. In The 30th ACM International Conference on Multimedia. 7165--7169.

Google Scholar

[14]

Beibei Zhang, Fan Yu, Yaqun Fang, Tongwei Ren, and Gangshan Wu. 2021. Hybrid improvements in multimodal analysis for deep video understanding. In ACM Multimedia Asia. 1--5.

Google Scholar

Index Terms

Deep Video Understanding with Video-Language Model
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

The surge in video and social media content underscores the need for a deeper understanding of multimedia data. Most of the existing mature video understanding techniques perform well with short formats and content that requires only shallow ...
Multimodal Analysis for Deep Video Understanding with Video Language Transformer
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

The Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning ...
Hybrid Improvements in Multimodal Analysis for Deep Video Understanding
MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

The Deep Video Understanding Challenge (DVU) is a task that focuses on comprehending long duration videos which involve many entities. Its main goal is to build relationship and interaction knowledge graph between entities to answer relevant questions. ...

Comments

Information & Contributors

Information

Published In

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fundamental Research Funds for the Central Universities
National Science Foundation of China
Collaborative Innovation Center of Novel Software Technology and Industrialization

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
186
Total Downloads

Downloads (Last 12 months)100
Downloads (Last 6 weeks)9

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

Supplemental Material

References

Index Terms

Recommendations

Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding

Multimodal Analysis for Deep Video Understanding with Video Language Transformer

Hybrid Improvements in Multimodal Analysis for Deep Video Understanding

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations