Spatio-temporal deformable 3D ConvNets with attention for action recognition

doi:10.1016/j.patcog.2019.107037

Pattern Recognition

Volume 98, February 2020, 107037

https://doi.org/10.1016/j.patcog.2019.107037 Get rights and content

Highlights

•
We are the first to propose a spatio-temporal deformable 3D convolutions with an attention mechanism (STDA for short).
•
The proposed module serves as a generic module for many 3D CNNs, and in practice it is only needed to append at the later convolution layer without increasing too much computational cost.
•
Our attention mechanism can exploit both long-range temporal dependencies across multiple frames and long-distance spatial dependencies inside each frame, and thus helps extract the discriminative global information at both inter-frame level and intra-frame level.
•
Experiments validate the superior performances and efficiency of the proposed approach.

Abstract

The irregularity of human actions poses great challenges in video action recognition. Recently, 3D ConvNet methods have shown promising performance at modelling the motion and appearance information. However, the fixed geometric structure of 3D convolution filters largely limits the learning capacity for video action recognition. To address this problem, this paper proposes a spatio-temporal deformable ConvNet module with an attention mechanism, which takes into consideration the mutual correlations in both temporal and spatial domains, to effectively capture the long-range and long-distance dependencies in the video actions. Our attention based deformable module, as a generic module for 3D ConvNets, can adaptively learn more accurate spatio-temporal offsets to model the action irregularity. The experiments on two popular datasets (UCF-101 and HMDB-51) demonstrate that our module significantly outperforms the state-of-the-art methods.

Introduction

Video action recognition [1], [2], [3], [4], [5] has been widely investigated in the computer vision community. Modelling the temporal and spatial variations is one of the most essential, yet challenging issues for it. This is mainly because video actions usually contain complex spatial-temporal correlations, including both long-range dependencies among the sequential frames and long-distance dependencies in the spatial fields of each frame. To address the problem, traditional solutions such as the improved Dense Trajectories (iDT) [6] rely on hand-crafted features (e.g., the optical flows) to obtain motion information. Recent deep end-to-end solutions like two-stream ConvNets [7] learn the discriminative features from the optical flow and the appearance via two 2D ConvNets respectively. Many studies prefer the recurrent neural networks (RNNs) over optical flow features and these include LSTM [8] and ConvLSTM [9] to model the video as an ordered sequence of frames.

Despite the good performance for optical flows based methods, extracting the optical flow is usually computationally intensive. RNNs based methods normally only focus on capturing coarse temporal structure relationship among frames. To avoid the expensive computation while capturing the spatio-temporal information, [10] first introduced 3D ConvNets to model the motion and appearance information simultaneously via the 3D convolution filter, leading to a number of 3D ConvNets methods, such as C3D [11], P3D ResNet [12], Two-stream I3D [13], mixed 3D/2D convolutional tube [14].

Unfortunately, the fixed geometric structures of 3D convolution filters in both receptive and sequential fields largely limit the learning capacity of the 3D ConvNets for video action recognition. Intuitively, different parts of the body may move in various directions along temporal and spatial dimensions for the same human action, which means the 3D convolutions tend to be irregular in practice. For instance, in the spatial dimension, the fixed receptive field prevents the high level convolution layers from encoding the semantics over the spatial locations, especially for non-rigid objects [15].

To deal with the irregularity, several techniques have been proposed in 2D ConvNets, such as spatial transform networks (STN) [16], active convolution [17] and deformable convolution [18]. Of all these techniques, the deformable convolution is the most successful solution as it can implicitly model large, unknown transformations. However, it can hardly obtain the long-range dependencies in the local receptive fields, even with the deformable geometric convolution structures. Besides, when the 2D spatial dimension is coupled with the temporal dimension, it becomes more difficult for 3D ConvNets to eliminate negative effects of the regular 3D cube geometric structures. The adaptive capturing of the temporal and spatial variations for action recognition still remains an open problem.

To capture the complex action variations, we first propose a Spatio-Temporal Deformable 3D convolution module with Attention mechanism (STDA for short). Our attention mechanism can exploit both long-range temporal dependencies across multiple frames and long-distance spatial dependencies within each frame, thus enabling the extracting of the discriminative global information at both inter-frame and intra-frame levels. Subsequently, different from the traditional convolution localized at the local regular receptive field [19], our spatio-temporal deformable 3D ConvNet can further capture the temporal and spatial irregularity by learning varying convolution filter offsets with the attention information.

To our best knowledge, we are the first to propose a novel efficient spatio-temporal deformable module equipped with an attention submodule in the 3D ConvNets for action recognition. Without much extra computational effort, the proposed module can easily replace the standard 3D convolution filter in the popular 3D ConvNets. Extensive evaluations on two diverse benchmarks (UCF-101 and HDMB-51) show that, compared with the state-of-the-art methods, our STDA module can significantly boost the overall recognition performance.

The reminder of the paper is structured as follows. Section 2 will present the related work including video action recognition, various types of ConvNets and attention models. Section 3 will elaborate the proposed module and the key techniques. In Section 4 we extensively evaluate our module over different datasets and 3D network architectures. Section 5 is a conclusion of the paper.

Section snippets

Video action recognition

Video analysis is an extensively studied topic in the literature [20], [21], [22], [23], [24], [25], especially for the action recognition task. Relying on the hand-crafted features like 3D-SIFT [26], and motion boundary histograms (MBH) [27], typical traditional methods such as [28] proposed space-time interest points by extending the notion of spatial interest points into the spatio-temporal domain, and proved that the resulting features can be used for a compact representation of video data.

The methodology

The notations applied in this paper are as follows. For the video clips, supposing we have a sequential input feature maps $X \in R^{c \times l \times h \times w},$ where c denotes the number of channels, l is the number of frames in each input sliding temporal window, and h, w respectively denote the height and the width of the feature map. Similarly, we define $Y \in R^{c^{'} \times l^{'} \times h^{'} \times w^{'}}$ as the output of X after passing through our STDA module. In practice, the 3D convolution filter shares the identical (e.g., 3) size in the three

Evaluation protocols

To evaluate our spatio-temporal deformable 3D ConvNets with attention (STDA), we follow the prior studies [30], [54], [55], [56], and choose the two popular yet challenging video datasets: UCF-101 and HMDB-51 to evaluate our spatio-temporal deformable 3D ConvNets with attention (STDA). Fig. 3 shows some typical action examples of UCF-101 dataset (the top row) and HMDB-51 dataset (the bottom row). For both datasets, we use the standard training/testing splits and we report the average accuracy.

Conclusions

We proposed a spatio-temporal deformable 3D ConvNet module with an attention mechanism to capture the complex action variations. The attention mechanism exploits both long-range temporal dependencies across multiple frames and long-distance spatial dependencies inside each frame, while with the global attention information the deformable 3D module can further capture the temporal and spatial variations via flexible convolution filter offsets. Our deformable module exhibits the fast computation

Acknowledgment

This work was supported by National Natural Science Foundation of China (61690202, 61872021), Fundamental Research Funds for Central Universities (YWF-19-BJ-J-271), Beijing Municipal Science and Technology Commission (Z171100000117022), and State Key Lab of Software Development Environment (SKLSDE-2018ZX-04).

References (61)

L. Chen et al.
Learning principal orientations and residual descriptor for action recognition
Pattern Recognit.
(2019)
H. Yang et al.
Asymmetric 3D convolutional neural networks for action recognition
Pattern Recognit.
(2019)
M. Ilyes Lakhal et al.
Residual stacked RNNS for action recognition
Proceedings of the European Conference on Computer Vision
(2018)
S. Ma et al.
Do less and achieve more: training CNNs for action recognition utilizing action images from the web
Pattern Recognit
(2017)
X. Wu
Fully convolutional networks for semantic segmentation
Comput. Sci.
(2015)
L. Wang et al.
Temporal segment networks: Towards good practices for deep action recognition
Proceedings of the European Conference on Computer Vision
(2016)
J. Qin et al.
Zero-shot action recognition with error-correcting output codes
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017)
X. Wang et al.
Two-stream 3-D convent fusion for action recognition in videos with arbitrary size and length
IEEE Trans. Multim.
(2017)
J. Qin et al.
Binary coding for partial action analysis with limited observation ratios
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017)
J. Qin et al.
Compressive sequential learning for action similarity labeling
IEEE Trans. Image Process.
(2015)

H. Wang et al.

Action recognition with improved trajectories

Proceedings of the IEEE International Conference on Computer Vision

(2014)

K. Simonyan et al.

Two-stream convolutional networks for action recognition in videos

Proceedings of the Advances In Neural Information Processing Systems

(2014)

Y.H. Ng et al.

Beyond short snippets: deep networks for video classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

H. Zhu et al.

Tornado: a spatio-temporal convolutional regression network for video action proposal

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

S. Ji et al.

3D Convolutional neural networks for human action recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

D. Tran et al.

Learning spatiotemporal features with 3D convolutional networks

Proceedings of the IEEE International Conference on Computer Vision

(2016)

Z. Qiu et al.

Learning spatio-temporal representation with pseudo-3D residual networks

Proceedings of the IEEE International Conference on Computer Vision

(2017)

J. Carreira et al.

Quo vadis, action recognition? a new model and the kinetics dataset

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

Y. Zhou et al.

MiCT: Mixed 3D/2D convolutional tube for human action recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

S. Ren et al.

Faster R-CNN: Towards real-time object detection with region proposal networks

Proceedings of the Advances in Neural Information Processing Systems

(2015)

M. Jaderberg et al.

Spatial transformer networks

Proceedings of the Advances in Neural Information Processing Systems

(2015)

Y. Jeon et al.

Active convolution: Learning the shape of convolution for image classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

J. Dai et al.

Deformable convolutional networks

CoRR

(2017)

C. Deng et al.

Active transfer learning network: a unified deep joint spectral–spatial feature learning model for hyperspectral image classification

IEEE Trans. Geosci. Remote Sens.

(2019)

W. Wang et al.

Semi-supervised video object segmentation with super-trajectories

IEEE Trans. Pattern Anal. Mach. Intell.

(2019)

J. Song et al.

Self-supervised video hashing with hierarchical binary auto-encoder

IEEE Trans. Image Process.

(2018)

W. Wang et al.

Saliency-aware video object segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

(2018)

W. Wang et al.

Revisiting video saliency: a large-scale benchmark and a new model

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

J. Song et al.

From deterministic to generative: Multimodal stochastic rnns for video captioning

IEEE Transactions on Neural Networks and Learning Systems

(2018)

K. Xia et al.

Temporal binary coding for large-scale video search

Proceedings of the ACM Conference on Multimedia

(2017)

Cited by (112)

Adaptive temporal aggregation for table tennis shot recognition
2024, Neurocomputing
Action recognition is one of the challenging video understanding tasks in computer vision. Although there has been extensive research in the task of classifying coarse-grained actions, existing methods are still limited in differentiating actions with low inter-class and high intra-class variation. Particularly, the table tennis sport that involves shots of high inter-class similarity, subtle variations, occlusion, and view-point variations. While a few datasets have been available for event spotting and shot recognition, these benchmarks are mostly recorded in a constrained environment with a clear view/perception of shots executed by players. In this paper, we introduce a Table tennis shots 1.0 dataset consisting of 9000 videos of 6 fine-grained actions collected in an unconstrained manner to analyze the performance of both players. To effectively recognize these different types of table tennis shots, we propose an adaptive temporal aggregation method that can handle the temporal interactions concerning the subtle variations among shots and low inter-class variations. Our method consists of two components, namely, (i) feature extraction module and (ii) temporal aggregation network. The feature extraction module is a 3D convolutional neural network (3D-CNN) that captures the spatial and temporal characteristics of table tennis shots. Here we propose to replace the final global average pooling layer (GAP) with the temporal aggregation network to overcome the loss of motion information due to averaging of temporal features. This temporal aggregation network utilizes the attention mechanism of bidirectional encoder representations from Transformers (BERT) to model the significant temporal interactions among the shots effectively. We demonstrate that our proposed approach improves the performance of existing 3D-CNN methods by $\approx$ 10% on the Table tennis shots 1.0 dataset.
Content Temporal Relation Network for temporal action proposal generation
2024, Pattern Recognition
Temporal action proposal generation is an essential step for untrimmed video analysis and gains much attention from academia. However, most of the prior works predict the confidence score of each proposal separately and neglect the relations between proposals, limiting their performance. In this work, we design a novel Content Temporal Relation Network (CTRNet) to generate temporal action proposals by exploring the content and temporal semantic relations between proposals simultaneously. Specifically, we design a proposal feature map generation layer to convert the temporal semantic relations of proposals into spatial relations. Based on the proposal feature map, we propose a content-temporal relation module, which applies a novel adaptive-dilated convolution to model the temporal semantic relations between proposals and designs a content-adaptive convolution operation to explore the content semantic relation between proposals. Considering the temporal and content semantic relations between proposals, CTRNet has learned discriminative proposal features to improve performance. Extensive experiments are performed on two mainstream temporal action detection datasets, and CTRNet significantly outperforms the previous state-of-the-art methods. The codes are available at https://github.com/YanZhang-bit/CTRNet.
Spatial–Temporal Attention Network for Depression Recognition from facial videos[Formula presented]
2024, Expert Systems with Applications
Recent studies focus on the utilization of deep learning approaches to recognize depression from facial videos. However, these approaches have been hindered by their limited performance, which can be attributed to the inadequate consideration of global spatial–temporal relationships in significant local regions within faces. In this paper, we propose Spatial–Temporal Attention Depression Recognition Network (STA-DRN) for depression recognition to enhance feature extraction and increase the relevance of depression recognition by capturing the global and local spatial–temporal information. Our proposed approach includes a novel Spatial–Temporal Attention (STA) mechanism, which generates spatial and temporal attention vectors to capture the global and local spatial–temporal relationships of features. To the best of our knowledge, this is the first attempt to incorporate pixel-wise STA mechanisms for depression recognition based on 3D video analysis. Additionally, we propose an attention vector-wise fusion strategy in the STA module, which combines information from both spatial and temporal domains. We then design the STA-DRN by stacking STA modules ResNet-style. The experimental results on AVEC 2013 and AVEC 2014 show that our method achieves competitive performance, with mean absolute error/root mean square error (MAE/RMSE) scores of 6.15/7.98 and 6.00/7.75, respectively. Moreover, visualization analysis demonstrates that the STA-DRN responds significantly in specific locations related to depression. The code is available at: https://github.com/divertingPan/STA-DRN.
Residual attention fusion network for video action recognition
2024, Journal of Visual Communication and Image Representation
Human action recognition in videos is a fundamental and important topic in computer vision, and modeling spatial–temporal dynamics in a video is crucial for action classification. In this paper, a novel attention module named Channel-wise Non-local Attention Module (CNAM) is proposed to highlight the important features both spatially and temporally. Besides, another new attention module named Channel-wise Attention Recalibration Module (CARM) is developed to focus on capturing discriminative features at channel level. Based on these two attention modules, a novel convolutional neural network named Residual Attention Fusion Network (RAFN) is proposed to model long-range temporal structure and capture more discriminative action features at the same time. More specifically, first, a sparse temporal sampling strategy is adopted to uniformly sample video data as input to RAFN along the temporal dimension. Secondly, the attention modules CNAM and CARM are plugged into residual network for highlighting important action regions around actors. Finally, the classification scores of four streams of RAFN are combined by late fusion. The experimental results on HMDB51 and UCF101 demonstrate the effectiveness and excellent recognition performance of our proposed method.
Temporal segment dropout for human action video recognition
2024, Pattern Recognition
Temporal information is important for human action video recognition. With the widely used spatio-temporal neural networks, researchers have found that the learned high-level features preserve overfitted spatial information and limited temporal information, leading to inferior performance. This is because existing networks lack efficient regularization for the temporal structure. To learn more robust temporal features, we propose a temporal regularization method named Temporal Segment Dropout (TSD). TSD drops the most salient spatial features in order to enhance the temporal features in a clip of temporal segments. Without learning from complex examples, TSD can be easily deployed in existing networks. In the experiment, TSD is extensively evaluated on benchmark action recognition datasets, which brings consistent improvements over the baselines, especially for the action-centric classes.
Relative-position embedding based spatially and temporally decoupled Transformer for action recognition
2024, Pattern Recognition
Recognition of human actions is to classify actions in a video. Recently, Vision Transformer (ViT) has been applied to action recognition. However, the Vision Transformer is unsuitable for high-resolution input videos due to the constraint of computing power since ViT splits frames into fixed-size patches embedded (i.e., tokens) with absolute-position information and adopts a pure Transformer encoder to model the relationships among these tokens. To address this issue, we propose a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for action recognition, which can capture spatial–temporal information by stacked self-attention layers. The proposed RPE-STDT model consists of two separate series of Transformer encoders. The first series of encoders is the spatial Transformer encoders, which model interactions between tokens extracted from the same temporal index. The second series of encoders is the temporal Transformer encoders, which model interactions across time dimensions with a subsampling strategy. Furthermore, we replace the absolute-position embeddings in the Vision Transformer encoders with the proposed relative-position embeddings to capture the order of the embedded tokens to reduce computational costs. Finally, we conduct thorough ablation studies. Our RPE-STDT achieves state-of-the-art results on multiple action recognition datasets, exceeding prior convolution and Transformer-based networks.

View all citing articles on Scopus

View full text

Spatio-temporal deformable 3D ConvNets with attention for action recognition

Highlights

Abstract

Introduction

Section snippets

Video action recognition

The methodology

Evaluation protocols

Conclusions

Acknowledgment

Pattern Recognit.

Pattern Recognit.

Pattern Recognit

Comput. Sci.

Zero-shot action recognition with error-correcting output codes

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Two-stream 3-D convent fusion for action recognition in videos with arbitrary size and length

IEEE Trans. Multim.

Binary coding for partial action analysis with limited observation ratios

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Compressive sequential learning for action similarity labeling

IEEE Trans. Image Process.

Action recognition with improved trajectories

Proceedings of the IEEE International Conference on Computer Vision

Two-stream convolutional networks for action recognition in videos

Proceedings of the Advances In Neural Information Processing Systems

Beyond short snippets: deep networks for video classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Tornado: a spatio-temporal convolutional regression network for video action proposal

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

3D Convolutional neural networks for human action recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Learning spatiotemporal features with 3D convolutional networks

Proceedings of the IEEE International Conference on Computer Vision

Learning spatio-temporal representation with pseudo-3D residual networks

Proceedings of the IEEE International Conference on Computer Vision

Quo vadis, action recognition? a new model and the kinetics dataset

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

MiCT: Mixed 3D/2D convolutional tube for human action recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Faster R-CNN: Towards real-time object detection with region proposal networks

Proceedings of the Advances in Neural Information Processing Systems

Spatial transformer networks

Proceedings of the Advances in Neural Information Processing Systems

Active convolution: Learning the shape of convolution for image classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Deformable convolutional networks

CoRR

Active transfer learning network: a unified deep joint spectral–spatial feature learning model for hyperspectral image classification

IEEE Trans. Geosci. Remote Sens.

Semi-supervised video object segmentation with super-trajectories

IEEE Trans. Pattern Anal. Mach. Intell.

Self-supervised video hashing with hierarchical binary auto-encoder

IEEE Trans. Image Process.

Saliency-aware video object segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

Revisiting video saliency: a large-scale benchmark and a new model

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

From deterministic to generative: Multimodal stochastic rnns for video captioning

IEEE Transactions on Neural Networks and Learning Systems

Temporal binary coding for large-scale video search

Proceedings of the ACM Conference on Multimedia