Loading [a11y]/accessibility-menu.js
Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning | IEEE Journals & Magazine | IEEE Xplore

Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning


Abstract:

Dense video captioning aims to detect all events of an uncropped video and generate corresponding textual captions for each event. Multi-modal information is essential to...Show More

Abstract:

Dense video captioning aims to detect all events of an uncropped video and generate corresponding textual captions for each event. Multi-modal information is essential to improve the performance of this task, but the existing methods mainly rely on the single visual or dual audio-visual modal input, while completely ignoring the text modal input (subtitle). Since the text data has a similar data representation as video caption words, it is conducive to the performance improvement of video captioning. In this article, we propose a novel framework, called the multi-stage fusion transformer network (MS-FTN), to realize multi-modal dense video captioning by fusing the text, the audio, and the visual features in stages. We present a multi-stage feature fusion encoder that first fuses audio and visual modalities at a lower level and then fuses them with a global-shared text representation at a higher level to generate a set of multi-modal complementary context features. In addition, an anchor-free event proposal module is proposed to efficiently generate a set of event proposals without the complex anchor calculation. Extensive experiments on the subsets of the ActivityNet Captions dataset show that our proposed MS-FTN achieves superior performance and efficient computation. Moreover, the ablation studies demonstrate that the global-shared text representation is more suitable for multi-modal dense video captioning.
Published in: IEEE Transactions on Multimedia ( Volume: 26)
Page(s): 3164 - 3179
Date of Publication: 23 August 2023

ISSN Information:

Funding Agency:


References

References is not available for this document.