Elsevier

Neurocomputing

Volume 445, 20 July 2021, Pages 72-80
Neurocomputing

Accelerated masked transformer for dense video captioning

https://doi.org/10.1016/j.neucom.2021.03.026Get rights and content

Highlights

  • We propose an Accelerated Masked Transformer (AMT) model for fast dense video captioning.

  • Compared with the counterpart, AMT is significantly faster while maintaining its performance.

  • The results on two real-world datasets demonstrate the effectiveness of AMT.

Abstract

Dense video captioning aims to generate dense descriptions for all possible events in an untrimmed video. The task is challenging that it requires accurately localizing events in the video and simultaneously describe each event with a sentence. Current approaches usually decompose this task into two independent stages—the proposal localization stage and the caption generation stage, resulting in a suboptimal solution. Masked Transformer (MT) model [30] has been proposed to integrate the two stages and optimize them in an end-to-end philosophy. Despite the superior performance that the MT has achieved, its runtime efficiency is unsatisfactory which severely limits its applicability in real-world scenarios. In this paper, we devise an improved Accelerated Masked Transformer (AMT) model that enjoys the dual-benefit of effectiveness and efficiency. Taking MT as our reference model, we respectively introduce accelerating strategies to the two stages: 1) in the proposal localization stage, we introduce a lightweight anchor-free proposal in company with a local attention mechanism; and 2) in the caption generation stage, we introduce the single-shot feature masking strategy along with an average attention mechanism. Extensive experiments on two benchmark datasets ActivityNet-Caption and YouCookII demonstrate that AMT achieves competitive performance on both datasets with significant speed improvement. On the ActivityNet-Caption dataset, AMT reduces up to 2× running time with comparable performance when compared to the reference MT model.

Introduction

The overwhelming amounts of videos on the Internet has brought up an urgent need for automatically extracting and understanding the essential information. Thanks to recent advances in deep multimodal learning, we are able to bridge vision with language. These achievements facilitate multimodal learning tasks such as cross-modal matching [24], [10], visual question answering [5], [25], [28], visual grounding [26], [12], [23], and visual captioning [17], [19], [9].

Video captioning requires a simultaneous understanding of both the spatial and temporal aspects of a video. Similar to image captioning [18], [22], the encoder-decoder framework with different attention mechanisms [9], [10], [20], [11], [3] are the main direction and has achieved top performance in multiple benchmarks. Yao et al. take into account both the local and global temporal structure of videos in caption generation [20]. Pan et al. introduce a hierarchical recurrent encoder with attention mechanisms to better characterize the temporal information of videos. Chen et al. introduce a temporal deformable convolutional encoder-decoder network to model the capture long-term relationships effectively [3]. Despite the success of the above approaches have achieved, they are limited to generate a single sentence and cannot be directly applied to generate a paragraph for a long-term video. Yu et al. use a hierarchical recurrent model to generate a paragraph for a long-term video [21], while Xiong et al. introduce event selection module determines which proposals need to be utilized for caption generation and then generate a coherent paragraph based on the selected event proposals [16].

In contrast to the video captioning task which generates a sentence or paragraph for a video, dense video captioning aims to localize and describe potential events from a long-term video at the same time. Current approaches usually decompose this task into two independent stages—the proposal localization stage and the caption generation stage. Based on this framework, Krishna et al. propose a multi-scale proposal network to generate event proposals and introduce a captioning network with an attention mechanism to exploit the visual context of the proposal [6]. Wang et al. employ a bi-directional RNN and a context gating module to improve the quality of generated proposals and captions, respectively [15]. However, the above approaches separate the proposal localization stage and the caption generation stage, resulting in a suboptimal solution.

To address this problem, an end-to-end Masked Transformer (MT) model has been proposed [30], which is inspired by the famous Transformer model in machine translation [14]. Specifically, the MT model adopts a deep encoder-decoder architecture that consists of a proposal encoder to predict event proposals and a caption decoder to generate a caption for each proposal. Unlike most of the existing approaches that use recurrent neural networks (RNNs) to model the temporal information, MT uses stacked self-attention blocks instead which is a potential way to better characterize the long-range dependence. Despite the superior performance that the MT has achieved, its runtime efficiency in the testing stage is unsatisfactory, which severely limits its applicability in real-world scenarios.

In this paper, we devise an improved Accelerated Masked Transformer (AMT) model that enjoys the dual-benefit of effectiveness and efficiency. Taking MT as our reference model, we respectively introduce accelerating strategies to the two stages: 1) in the proposal localization stage, we introduce a lightweight ‘anchor-free’ proposal predictor combined with a local attention mechanism; and 2) in the caption generation stage, we respectively introduce the single-shot feature masking strategy along with an average attention mechanism. Compared with the 3-layer Masked Transformer model [30] in the same experimental environment, our AMT model is about 2× faster in the testing stage while delivering slightly better performance (see Fig. 1).

The rest of the paper is organized as follows: In Section 2, we revisit the MT model for dense video captioning and then propose the Accelerated Multimodal Transformer (AMT) model in Section 3. In Section 4, we introduce our extensive experimental results for algorithm evaluation and use the benchmark Activity-Caption and Youcook-II datasets to evaluate our proposed approaches. Finally, we conclude this work in Section 5.

Section snippets

Revisiting masked transformer

In this section, we revisit some background of the Masked Transformer [30] model for dense video captioning, which is the reference counterpart for our model. Before that, we first introduce the multi-head attention and feed-forward networks, the basic building blocks of Masked Transformer. After that, we describe the components of Masked Transformer in detail and then analyze its efficiency bottleneck.

Accelerated masked transformer

To address the efficiency bottlenecks of the MT model above, we introduce our improvements on the MT model and propose an Accelerated Masked Transformer (AMT) model shown in Fig. 2. Compared with the reference MT model, our improvement is mainly reflected in the following four aspects: the local attention mechanism in the MHA layer of each the encoder block, the anchor-free proposal predictor for proposal generation, the single-shot feature masking, and the average attention module (AAM) in

Experimental results

In this section, we conduct experiments to evaluate the AMT models on the ActivityNet-Caption [6] and YouCookII [29] datasets.

Conclusions

In this paper, we propose an Accelerated Masked Transformer (AMT) model for fast dense video captioning. Taking the encoder-decoder-based Masked Transformer (MT) [30] as the reference model, we respectively introduce accelerating strategies to the encoder and decoder of the MT model to resolve their efficiency bottleneck. Extensive experiments on two benchmark datasets ActivityNet-Caption and YouCookII demonstrate that AMT achieves competitive performance on both datasets with significant speed

CRediT authorship contribution statement

Zhou Yu: Conceptualization, Methodology, Writing - original draft, Writing - review & editing, Investigation. Nanjia Han: Data curation, Investigation, Methodology, Software.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported in part by the National Key R&D Program of China (Grant No. 2020YFB1406701), and in part by National Natural Science Foundation of China under Grant 62072147 and Grant 61836002.

Zhou Yu received the B.Eng. and Ph.D. degrees from Zhejiang University, Zhejiang, China, in 2010 and 2015, respectively. He is currently an Associate Professor with the School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China. His research interests includes multimodal analysis, computer vision, ma-chine learning and deep learning. He has served as reviewers or program committee members of prestigious journals and top conferences including Neurocomputing, IEEE

References (30)

  • J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization, 2016. arXiv preprint...
  • F. Caba Heilbron et al.

    Activitynet: a large-scale video benchmark for human activity understanding

  • J. Chen et al.

    Temporal deformable convolutional encoder-decoder networks for video captioning

  • K. He et al.

    Deep residual learning for image recognition

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • J.H. Kim et al.

    Bilinear attention networks

  • R. Krishna et al.

    Dense-captioning events in videos

  • H. Law et al.

    Cornernet: detecting objects as paired keypoints

  • J. Mun et al.

    Streamlined dense video captioning

  • P. Pan et al.

    Hierarchical recurrent neural encoder for video representation with application to captioning

  • Y. Pan et al.

    Jointly modeling embedding and translation to bridge video and language

  • Y. Pan et al.

    Video captioning with transferred semantic attributes

  • A. Rohrbach et al.

    Grounding of textual phrases in images by reconstruction

    European Conference on Computer Vision (ECCV)

    (2016)
  • Z. Tian et al.

    Fcos: fully convolutional one-stage object detection

  • A. Vaswani et al.

    Attention is all you need

    Adv. Neural Inf. Process. Syst. (NIPS)

    (2017)
  • J. Wang et al.

    Bidirectional attentive fusion with context gating for dense video captioning

  • Cited by (17)

    • ACORT: A compact object relation transformer for parameter efficient image captioning

      2022, Neurocomputing
      Citation Excerpt :

      This framework was then extended by replacing the Long-Short Term Memory (LSTM) decoder with Transformers [1]. Such works include [22,23,4,24], and ORT [3], which is the baseline used in this work. More details on ORT are given in Section 3.

    • Video Transformers: A Survey

      2023, IEEE Transactions on Pattern Analysis and Machine Intelligence
    View all citing articles on Scopus

    Zhou Yu received the B.Eng. and Ph.D. degrees from Zhejiang University, Zhejiang, China, in 2010 and 2015, respectively. He is currently an Associate Professor with the School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China. His research interests includes multimodal analysis, computer vision, ma-chine learning and deep learning. He has served as reviewers or program committee members of prestigious journals and top conferences including Neurocomputing, IEEE Trans. on CSVT, IEEE Trans. on Multimedia, IEEE Trans. on Image Processing, IJCAI and AAAI, etc.

    Nanjia Han received the B.Eng. degree from the School of Management, China Jiliang University, Hangzhou, China, in 2018. He is currently pursuing the M.Eng. degree with the School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China. His current research interests include multimodal analysis, computer vision and machine learning.

    View full text