ABSTRACT
Pre-trained video-language models (VLMs) have shown superior performance in high-level video understanding tasks, analyzing multi-modal information, aligning with Deep Video Understanding Challenge (DVUC) requirements.In this paper, we explore pre-trained VLMs' potential in multimodal question answering for long-form videos. We propose a solution called Dual Branches Video Modeling (DBVM), which combines knowledge graph (KG) and VLMs, leveraging their strengths and addressing shortcomings.The KG branch recognizes and localizes entities, fuses multimodal features at different levels, and constructs KGs with entities as nodes and relationships as edges.The VLM branch applies a selection strategy to adapt input movies into acceptable length and a cross-matching strategy to post-process results providing accurate scene descriptions.Experiments conducted on the DVUC dataset validate the effectiveness of our DBVM.
Supplemental Material
- Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In International Conference on Multimedia Retrieval. 355--361.Google ScholarDigital Library
- Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, and Mike Zheng Shou. 2023. MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14773--14783.Google Scholar
- Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293--304.Google ScholarDigital Library
- Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. 2022. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision. Springer, 1--18.Google ScholarDigital Library
- Penggang Qin, Jiarui Yu, Yan Gao, Derong Xu, Yunkai Chen, Shiwei Wu, Tong Xu, Enhong Chen, and Yanbin Hao. 2022. Unified QA-aware knowledge graph generation based on multi-modal modeling. In The 30th ACM International Conference on Multimedia. 7185--7189.Google ScholarDigital Library
- Raksha Ramesh, Vishal Anand, Zifan Chen, Yifei Dong, Yun Chen, and Ching-Yung Lin. 2022. Leveraging Text Representation and Face-head Tracking for Long- form Multimodal Semantic Relation Understanding. In The 30th ACM International Conference on Multimedia. 7215--7219.Google ScholarDigital Library
- Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A local-to-global approach to multi-modal movie scene segmentation. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10146--10155.Google ScholarCross Ref
- Siyang Sun, Xiong Xiong, and Yun Zheng. 2022. Two stage Multi-Modal Modeling for Video Interaction Analysis in Deep Video Understanding Challenge. In The 30th ACM International Conference on Multimedia. 7040--7044.Google ScholarDigital Library
- Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, and Jianlong Fu. 2022. Long-form video-language pre-training with multimodal temporal contrastive learning. Advances in neural information processing systems 35 (2022), 38032--38045.Google Scholar
- Chen-Wei Xie, Siyang Sun, Liming Zhao, Jianmin Wu, Dangwei Li, and Yun Zheng. 2022. Deep Video Understanding with a Unified Multi-Modal Retrieval Framework. In The 30th ACM International Conference on Multimedia. 7055--7059.Google ScholarDigital Library
- Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, and Luke Zettlemoyer. 2021. Vlm: Task- agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996 (2021).Google Scholar
- Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. 2023. Self-Chained Image-Language Model for Video Localization and Question Answering. arXiv preprint arXiv:2305.06988 (2023).Google Scholar
- Beibei Zhang, Yaqun Fang, Tongwei Ren, and Gangshan Wu. 2022. Multimodal Analysis for Deep Video Understanding with Video Language Transformer. In The 30th ACM International Conference on Multimedia. 7165--7169.Google Scholar
- Beibei Zhang, Fan Yu, Yaqun Fang, Tongwei Ren, and Gangshan Wu. 2021. Hybrid improvements in multimodal analysis for deep video understanding. In ACM Multimedia Asia. 1--5.Google Scholar
Index Terms
- Deep Video Understanding with Video-Language Model
Recommendations
Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding
MM '23: Proceedings of the 31st ACM International Conference on MultimediaThe surge in video and social media content underscores the need for a deeper understanding of multimedia data. Most of the existing mature video understanding techniques perform well with short formats and content that requires only shallow ...
Multimodal Analysis for Deep Video Understanding with Video Language Transformer
MM '22: Proceedings of the 30th ACM International Conference on MultimediaThe Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning ...
Deep Video Understanding of Character Relationships in Movies
ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal InteractionHumans can easily understand storylines and character relationships in movies. However, the automatic relationship analysis from videos is challenging. In this paper, we introduce a deep video understanding system to infer relationships between movie ...
Comments