skip to main content
10.1145/3581783.3612863acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Deep Video Understanding with Video-Language Model

Published:27 October 2023Publication History

ABSTRACT

Pre-trained video-language models (VLMs) have shown superior performance in high-level video understanding tasks, analyzing multi-modal information, aligning with Deep Video Understanding Challenge (DVUC) requirements.In this paper, we explore pre-trained VLMs' potential in multimodal question answering for long-form videos. We propose a solution called Dual Branches Video Modeling (DBVM), which combines knowledge graph (KG) and VLMs, leveraging their strengths and addressing shortcomings.The KG branch recognizes and localizes entities, fuses multimodal features at different levels, and constructs KGs with entities as nodes and relationships as edges.The VLM branch applies a selection strategy to adapt input movies into acceptable length and a cross-matching strategy to post-process results providing accurate scene descriptions.Experiments conducted on the DVUC dataset validate the effectiveness of our DBVM.

Skip Supplemental Material Section

Supplemental Material

mmgc031-video.mp4

mp4

23.2 MB

References

  1. Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In International Conference on Multimedia Retrieval. 355--361.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, and Mike Zheng Shou. 2023. MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14773--14783.Google ScholarGoogle Scholar
  3. Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293--304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. 2022. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision. Springer, 1--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Penggang Qin, Jiarui Yu, Yan Gao, Derong Xu, Yunkai Chen, Shiwei Wu, Tong Xu, Enhong Chen, and Yanbin Hao. 2022. Unified QA-aware knowledge graph generation based on multi-modal modeling. In The 30th ACM International Conference on Multimedia. 7185--7189.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Raksha Ramesh, Vishal Anand, Zifan Chen, Yifei Dong, Yun Chen, and Ching-Yung Lin. 2022. Leveraging Text Representation and Face-head Tracking for Long- form Multimodal Semantic Relation Understanding. In The 30th ACM International Conference on Multimedia. 7215--7219.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A local-to-global approach to multi-modal movie scene segmentation. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10146--10155.Google ScholarGoogle ScholarCross RefCross Ref
  8. Siyang Sun, Xiong Xiong, and Yun Zheng. 2022. Two stage Multi-Modal Modeling for Video Interaction Analysis in Deep Video Understanding Challenge. In The 30th ACM International Conference on Multimedia. 7040--7044.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, and Jianlong Fu. 2022. Long-form video-language pre-training with multimodal temporal contrastive learning. Advances in neural information processing systems 35 (2022), 38032--38045.Google ScholarGoogle Scholar
  10. Chen-Wei Xie, Siyang Sun, Liming Zhao, Jianmin Wu, Dangwei Li, and Yun Zheng. 2022. Deep Video Understanding with a Unified Multi-Modal Retrieval Framework. In The 30th ACM International Conference on Multimedia. 7055--7059.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, and Luke Zettlemoyer. 2021. Vlm: Task- agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996 (2021).Google ScholarGoogle Scholar
  12. Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. 2023. Self-Chained Image-Language Model for Video Localization and Question Answering. arXiv preprint arXiv:2305.06988 (2023).Google ScholarGoogle Scholar
  13. Beibei Zhang, Yaqun Fang, Tongwei Ren, and Gangshan Wu. 2022. Multimodal Analysis for Deep Video Understanding with Video Language Transformer. In The 30th ACM International Conference on Multimedia. 7165--7169.Google ScholarGoogle Scholar
  14. Beibei Zhang, Fan Yu, Yaqun Fang, Tongwei Ren, and Gangshan Wu. 2021. Hybrid improvements in multimodal analysis for deep video understanding. In ACM Multimedia Asia. 1--5.Google ScholarGoogle Scholar

Index Terms

  1. Deep Video Understanding with Video-Language Model

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)121
      • Downloads (Last 6 weeks)33

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader