Journals & Magazines >IEEE Transactions on Multimedia >Volume: 26

Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Dense video captioning aims to detect all events of an uncropped video and generate corresponding textual captions for each event. Multi-modal information is essential to...Show More

Metadata

Abstract:

Dense video captioning aims to detect all events of an uncropped video and generate corresponding textual captions for each event. Multi-modal information is essential to improve the performance of this task, but the existing methods mainly rely on the single visual or dual audio-visual modal input, while completely ignoring the text modal input (subtitle). Since the text data has a similar data representation as video caption words, it is conducive to the performance improvement of video captioning. In this article, we propose a novel framework, called the multi-stage fusion transformer network (MS-FTN), to realize multi-modal dense video captioning by fusing the text, the audio, and the visual features in stages. We present a multi-stage feature fusion encoder that first fuses audio and visual modalities at a lower level and then fuses them with a global-shared text representation at a higher level to generate a set of multi-modal complementary context features. In addition, an anchor-free event proposal module is proposed to efficiently generate a set of event proposals without the complex anchor calculation. Extensive experiments on the subsets of the ActivityNet Captions dataset show that our proposed MS-FTN achieves superior performance and efficient computation. Moreover, the ablation studies demonstrate that the global-shared text representation is more suitable for multi-modal dense video captioning.

Published in: IEEE Transactions on Multimedia ( Volume: 26)

Page(s): 3164 - 3179

Date of Publication: 23 August 2023

ISSN Information:

DOI: 10.1109/TMM.2023.3307972

Funding Agency:

Contents

References is not available for this document.

Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?