Loading [MathJax]/extensions/TeX/ieee_stixext.js
Stay in Grid: Improving Video Captioning via Fully Grid-Level Representation | IEEE Journals & Magazine | IEEE Xplore

Stay in Grid: Improving Video Captioning via Fully Grid-Level Representation


Abstract:

Video captioning is a challenging task of automatically generating natural and meaningful textual descriptions given some context videos. The state-of-the-art methods agg...Show More

Abstract:

Video captioning is a challenging task of automatically generating natural and meaningful textual descriptions given some context videos. The state-of-the-art methods aggregate the spatial-wise information in the video encoder at the early stage, which has two drawbacks: 1) Early aggregation in the encoder can cause considerable spatial details missing, which may consequently lead to incorrect word choices in the following text encoder. 2) The spatial attention learned in the video encoder may not be compelling enough without text guidance. To solve these problems, we propose a Stay-in-Grid video CAPtioning method SGCAP, which makes full use of the grid-level spatial features and consists of a Bilinear Sequential Attention Encoder (BSAE) and a Cross-modal Sequential Attention Decoder (CSAD). The former explores and retains fully grid-level discriminative representations in the video encoder, while the latter performs the late spatial aggregation in the decoder to attend to the most relevant regions with the supervision of the input words. Experimental results demonstrate the effectiveness of our method on three public datasets, showing its superior performance over multiple state-of-the-art video captioning models. Source codes and the pre-trained models will be made available to the public.
Page(s): 3319 - 3332
Date of Publication: 27 December 2022

ISSN Information:

Funding Agency:


References

References is not available for this document.