Abstract:
Video captioning aims to generate natural language descriptions automatically from videos. While datasets like MSVD and MSR-VTT have driven research in recent years, they...Show MoreMetadata
Abstract:
Video captioning aims to generate natural language descriptions automatically from videos. While datasets like MSVD and MSR-VTT have driven research in recent years, they predominantly focus on visual features and describe simple actions, ignoring audio, text, and other modal information. Which, however, is limited, because multi-modal information plays an important role in generating accurate captions. In this study, we introduce a dataset, News-11k, which includes over 150,000 captions with multi-modal information from more than 11,000 selected news video clips. We annotate multi-granularity captions from three perspectives: coarse-grained, medium-grained, and fine-grained captions. Due to the characteristics of news videos, generating accurate captions on our dataset requires multi-modal understanding ability. Therefore, we propose a baseline model for multi-modal video captioning. To address the challenge of multi-modal information fusion, we devise the concatenating modal embedding strategy. Experiments indicate that multi-modal information significantly enhances the understanding of the deeper semantics in videos. Data will be made available on https://github.com/David-Zeng-Zijian/News-11k.
Date of Conference: 15-19 July 2024
Date Added to IEEE Xplore: 30 September 2024
ISBN Information: