Elsevier

Pattern Recognition

Volume 98, February 2020, 107037
Pattern Recognition

Spatio-temporal deformable 3D ConvNets with attention for action recognition

https://doi.org/10.1016/j.patcog.2019.107037Get rights and content

Highlights

  • We are the first to propose a spatio-temporal deformable 3D convolutions with an attention mechanism (STDA for short).

  • The proposed module serves as a generic module for many 3D CNNs, and in practice it is only needed to append at the later convolution layer without increasing too much computational cost.

  • Our attention mechanism can exploit both long-range temporal dependencies across multiple frames and long-distance spatial dependencies inside each frame, and thus helps extract the discriminative global information at both inter-frame level and intra-frame level.

  • Experiments validate the superior performances and efficiency of the proposed approach.

Abstract

The irregularity of human actions poses great challenges in video action recognition. Recently, 3D ConvNet methods have shown promising performance at modelling the motion and appearance information. However, the fixed geometric structure of 3D convolution filters largely limits the learning capacity for video action recognition. To address this problem, this paper proposes a spatio-temporal deformable ConvNet module with an attention mechanism, which takes into consideration the mutual correlations in both temporal and spatial domains, to effectively capture the long-range and long-distance dependencies in the video actions. Our attention based deformable module, as a generic module for 3D ConvNets, can adaptively learn more accurate spatio-temporal offsets to model the action irregularity. The experiments on two popular datasets (UCF-101 and HMDB-51) demonstrate that our module significantly outperforms the state-of-the-art methods.

Introduction

Video action recognition [1], [2], [3], [4], [5] has been widely investigated in the computer vision community. Modelling the temporal and spatial variations is one of the most essential, yet challenging issues for it. This is mainly because video actions usually contain complex spatial-temporal correlations, including both long-range dependencies among the sequential frames and long-distance dependencies in the spatial fields of each frame. To address the problem, traditional solutions such as the improved Dense Trajectories (iDT) [6] rely on hand-crafted features (e.g., the optical flows) to obtain motion information. Recent deep end-to-end solutions like two-stream ConvNets [7] learn the discriminative features from the optical flow and the appearance via two 2D ConvNets respectively. Many studies prefer the recurrent neural networks (RNNs) over optical flow features and these include LSTM [8] and ConvLSTM [9] to model the video as an ordered sequence of frames.

Despite the good performance for optical flows based methods, extracting the optical flow is usually computationally intensive. RNNs based methods normally only focus on capturing coarse temporal structure relationship among frames. To avoid the expensive computation while capturing the spatio-temporal information, [10] first introduced 3D ConvNets to model the motion and appearance information simultaneously via the 3D convolution filter, leading to a number of 3D ConvNets methods, such as C3D [11], P3D ResNet [12], Two-stream I3D [13], mixed 3D/2D convolutional tube [14].

Unfortunately, the fixed geometric structures of 3D convolution filters in both receptive and sequential fields largely limit the learning capacity of the 3D ConvNets for video action recognition. Intuitively, different parts of the body may move in various directions along temporal and spatial dimensions for the same human action, which means the 3D convolutions tend to be irregular in practice. For instance, in the spatial dimension, the fixed receptive field prevents the high level convolution layers from encoding the semantics over the spatial locations, especially for non-rigid objects [15].

To deal with the irregularity, several techniques have been proposed in 2D ConvNets, such as spatial transform networks (STN) [16], active convolution [17] and deformable convolution [18]. Of all these techniques, the deformable convolution is the most successful solution as it can implicitly model large, unknown transformations. However, it can hardly obtain the long-range dependencies in the local receptive fields, even with the deformable geometric convolution structures. Besides, when the 2D spatial dimension is coupled with the temporal dimension, it becomes more difficult for 3D ConvNets to eliminate negative effects of the regular 3D cube geometric structures. The adaptive capturing of the temporal and spatial variations for action recognition still remains an open problem.

To capture the complex action variations, we first propose a Spatio-Temporal Deformable 3D convolution module with Attention mechanism (STDA for short). Our attention mechanism can exploit both long-range temporal dependencies across multiple frames and long-distance spatial dependencies within each frame, thus enabling the extracting of the discriminative global information at both inter-frame and intra-frame levels. Subsequently, different from the traditional convolution localized at the local regular receptive field [19], our spatio-temporal deformable 3D ConvNet can further capture the temporal and spatial irregularity by learning varying convolution filter offsets with the attention information.

To our best knowledge, we are the first to propose a novel efficient spatio-temporal deformable module equipped with an attention submodule in the 3D ConvNets for action recognition. Without much extra computational effort, the proposed module can easily replace the standard 3D convolution filter in the popular 3D ConvNets. Extensive evaluations on two diverse benchmarks (UCF-101 and HDMB-51) show that, compared with the state-of-the-art methods, our STDA module can significantly boost the overall recognition performance.

The reminder of the paper is structured as follows. Section 2 will present the related work including video action recognition, various types of ConvNets and attention models. Section 3 will elaborate the proposed module and the key techniques. In Section 4 we extensively evaluate our module over different datasets and 3D network architectures. Section 5 is a conclusion of the paper.

Section snippets

Video action recognition

Video analysis is an extensively studied topic in the literature [20], [21], [22], [23], [24], [25], especially for the action recognition task. Relying on the hand-crafted features like 3D-SIFT [26], and motion boundary histograms (MBH) [27], typical traditional methods such as [28] proposed space-time interest points by extending the notion of spatial interest points into the spatio-temporal domain, and proved that the resulting features can be used for a compact representation of video data.

The methodology

The notations applied in this paper are as follows. For the video clips, supposing we have a sequential input feature maps XRc×l×h×w, where c denotes the number of channels, l is the number of frames in each input sliding temporal window, and h, w respectively denote the height and the width of the feature map. Similarly, we define YRc×l×h×w as the output of X after passing through our STDA module. In practice, the 3D convolution filter shares the identical (e.g., 3) size in the three

Evaluation protocols

To evaluate our spatio-temporal deformable 3D ConvNets with attention (STDA), we follow the prior studies [30], [54], [55], [56], and choose the two popular yet challenging video datasets: UCF-101 and HMDB-51 to evaluate our spatio-temporal deformable 3D ConvNets with attention (STDA). Fig. 3 shows some typical action examples of UCF-101 dataset (the top row) and HMDB-51 dataset (the bottom row). For both datasets, we use the standard training/testing splits and we report the average accuracy.

Conclusions

We proposed a spatio-temporal deformable 3D ConvNet module with an attention mechanism to capture the complex action variations. The attention mechanism exploits both long-range temporal dependencies across multiple frames and long-distance spatial dependencies inside each frame, while with the global attention information the deformable 3D module can further capture the temporal and spatial variations via flexible convolution filter offsets. Our deformable module exhibits the fast computation

Acknowledgment

This work was supported by National Natural Science Foundation of China (61690202, 61872021), Fundamental Research Funds for Central Universities (YWF-19-BJ-J-271), Beijing Municipal Science and Technology Commission (Z171100000117022), and State Key Lab of Software Development Environment (SKLSDE-2018ZX-04).

References (61)

  • H. Wang et al.

    Action recognition with improved trajectories

    Proceedings of the IEEE International Conference on Computer Vision

    (2014)
  • K. Simonyan et al.

    Two-stream convolutional networks for action recognition in videos

    Proceedings of the Advances In Neural Information Processing Systems

    (2014)
  • Y.H. Ng et al.

    Beyond short snippets: deep networks for video classification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • H. Zhu et al.

    Tornado: a spatio-temporal convolutional regression network for video action proposal

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • S. Ji et al.

    3D Convolutional neural networks for human action recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • D. Tran et al.

    Learning spatiotemporal features with 3D convolutional networks

    Proceedings of the IEEE International Conference on Computer Vision

    (2016)
  • Z. Qiu et al.

    Learning spatio-temporal representation with pseudo-3D residual networks

    Proceedings of the IEEE International Conference on Computer Vision

    (2017)
  • J. Carreira et al.

    Quo vadis, action recognition? a new model and the kinetics dataset

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • Y. Zhou et al.

    MiCT: Mixed 3D/2D convolutional tube for human action recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • S. Ren et al.

    Faster R-CNN: Towards real-time object detection with region proposal networks

    Proceedings of the Advances in Neural Information Processing Systems

    (2015)
  • M. Jaderberg et al.

    Spatial transformer networks

    Proceedings of the Advances in Neural Information Processing Systems

    (2015)
  • Y. Jeon et al.

    Active convolution: Learning the shape of convolution for image classification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • J. Dai et al.

    Deformable convolutional networks

    CoRR

    (2017)
  • C. Deng et al.

    Active transfer learning network: a unified deep joint spectral–spatial feature learning model for hyperspectral image classification

    IEEE Trans. Geosci. Remote Sens.

    (2019)
  • W. Wang et al.

    Semi-supervised video object segmentation with super-trajectories

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2019)
  • J. Song et al.

    Self-supervised video hashing with hierarchical binary auto-encoder

    IEEE Trans. Image Process.

    (2018)
  • W. Wang et al.

    Saliency-aware video object segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • W. Wang et al.

    Revisiting video saliency: a large-scale benchmark and a new model

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • J. Song et al.

    From deterministic to generative: Multimodal stochastic rnns for video captioning

    IEEE Transactions on Neural Networks and Learning Systems

    (2018)
  • K. Xia et al.

    Temporal binary coding for large-scale video search

    Proceedings of the ACM Conference on Multimedia

    (2017)
  • Cited by (112)

    • Residual attention fusion network for video action recognition

      2024, Journal of Visual Communication and Image Representation
    View all citing articles on Scopus
    View full text