Conferences >ICASSP 2023 - 2023 IEEE Inter...

CLIP4VideoCap: Rethinking Clip for Video Captioning with Multiscale Temporal Fusion and Commonsense Knowledge

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In this paper, we propose CLIP4VideoCap for video captioning based on large-scale pre-trained CLIP image and text encoders together with multi-scale temporal reasoning an...Show More

Metadata

Abstract:

In this paper, we propose CLIP4VideoCap for video captioning based on large-scale pre-trained CLIP image and text encoders together with multi-scale temporal reasoning and commonsense knowledge. In addition to the CLIP-image encoder operating on successive video frames, we introduce a knowledge distillation-based learning scheme that aims to exploit the CLIP-text encoder to generate rich textual knowledge from the image features. For improved temporal reasoning over the video, we propose a multi-scale temporal fusion scheme that accumulates temporal features from different temporal windows. In addition, we integrate various commonsense aspects in the caption generation which greatly enhances the caption quality by extracting the commonsense features from the video in the intermediate phase. Combining these strategies, we achieve state-of-the-art performance on the benchmark MSR-VTT dataset confirming that our framework significantly outperforms existing approaches.

Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 04-10 June 2023

Date Added to IEEE Xplore: 05 May 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49357.2023.10097128

Conference Location: Rhodes Island, Greece

Funding Agency:

Contents

References is not available for this document.

CLIP4VideoCap: Rethinking Clip for Video Captioning with Multiscale Temporal Fusion and Commonsense Knowledge

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

CLIP4VideoCap: Rethinking Clip for Video Captioning with Multiscale Temporal Fusion and Commonsense Knowledge

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?