Multi-cue based four-stream 3D ResNets for video-based action recognition
Introduction
Action recognition is a research hotspot in computer vision, which has wide application areas such as intelligent surveillance, video search and smart healthcare [10]. Video-based action recognition refers to categorising/classifying an action in a video, i.e. action-based video classification. An image usually contains two-dimensional spatial information while a video usually contains three-dimensional spatiotemporal information. In other words, a video contains extra temporal information compared with an image [31]. Thus, video-based action recognition is more challenging than image classification since three-dimensional spatiotemporal feature representation for a video is more complicated than two-dimensional spatial feature representation for an image.
Deep convolutional neural networks (CNNs) have been widely used in image processing, computer vision, natural language processing and other fields [25], [13], [24], [34], [43], e.g. image recognition [45], object detection and face recognition [46]. The main reason for this success is that deep CNNs can learn deep hierarchical visual feature representation layer by layer, which is different from conventional shallow methods [9], [39], [44]. For action recognition, many deep learning based methods were developed.
Among various deep-learning-based action recognition models [48], [47], the two-stream CNN model [26] and 3D CNN model [33] are two widely used deep learning models. The two-stream CNN model uses two different streams to separately extract spatial features and temporal features in videos. The advantage of a two-stream CNN model is that it can explicitly capture the appearance cue and motion cue separately. The 3D CNN model performs convolutions along both the spatial dimension and temporal dimension simultaneously. Thus, the 3D CNN model is a natural choice to capture the spatiotemporal cue. However, both previous two-stream CNN models and 3D CNN models ignored two important points: (i) The salient information in the videos, which is important for identifying actions. (ii) The audio information in the videos, which is important for identifying some special actions, such as playing instruments.
Recently, Ming et al.[47] proposed a motion saliency information based two-stream model, which utilized the salient motion cue to improve recognition accuracy. Yan et al. [30] proposed adding an audio stream to a two-stream model, which utilized the audio cue to enhance the accuracy of action recognition. However, they have not considered including both the salient cue and audio cue simultaneously in their models.
In addition, for 3D CNN models, Carreria and Zisserman [5] proposed a two-stream inflated 3D CNN (I3D) model for action recognition, which used 3D CNN to replace 2D CNN in a two-stream CNN model. The I3D model has been proven to be far better than the two-stream CNN model. Hara et al. [11] proposed a deep 3D ResNet for action recognition. The differences between 3D ResNet and I3D include: (i) The first difference is the input: I3D contains two streams including RGB video frames and optical flow frames as the input, while 3D ResNet only contains a single stream including RGB video frames as the input. (ii) The second difference is the network: I3D adopts the Inception network [14] as the backbone network by expanding 2D convolutions into 3D convolutions, while 3D ResNet adopts ResNet [12] as the backbone network by expanding 2D convolutions into 3D convolutions.
Based on the above considerations, we propose a novel Multi-cue based Four-stream 3D ResNets (named MF3D model for short) for video-based action recognition. The MF3D model is composed of four streams: a video saliency stream, an appearance stream, a motion stream and an audio stream. Four cues (i.e. the appearance cue, the motion cue, the video saliency cue and audio cue) are captured by the four streams of the proposed MF3D model. In addition, three different types of connections between different streams are injected, which can transfer different cues between different streams to obtain more effective spatiotemporal features.
The key contributions can be summarized as follows:
- •
Four important cues (i.e. the appearance cue, the video saliency cue, the motion cue and audio cue) are proposed and integrated into a model for video-based action recognition.
- •
We propose a video saliency stream guided by motion information for action recognition. Further, the interactive connections between different streams are proposed.
- •
3D ResNet is adopted as the implementation network of the four streams (i.e. the video saliency stream, the appearance stream, the motion stream and the audio stream) in our MF3D model.
The rest of this paper is organized as follows: we introduce the related work of action recognition in Section 2. The MF3D model is presented in Section 3. Experiments are described in Section 4. Finally, conclusions are drawn in Section 5.
Section snippets
Related work
Action recognition has received a lot of attentions in the last few years. The recent studies which are related to our approach are reviewed.
Our MF3D model
The proposed MF3D model for video-based action recognition is first introduced in Section 3.1. Then, the detailed structure of residual block with interactive connection in our MF3D model is presented in Section 3.2. Finally, the interactive process of forward propagation and backpropagation of the proposed MF3D model is presented in Section 3.3.
Experimental environments
We conduct all the experiments on a deep learning workstation. The CPU is Intel [email protected] GHz and the memory is 252 GB. There are two NVIDIA GPUs, i.e. two Quadro RTX 6000 GPUs, and the GPU memory of each Quadro RTX 6000 is 24 GB. Thus, the available total GPU memory of the workstation is 48 GB for experiments. The operation system is Ubuntu 20.04 version. We use Python 3 to implement the proposed MF3D model, and PyTorch 1.6 library is used to implement and compute the deep neural networks
Conclusion
We proposed a multi-cue based four-stream 3D ResNets (MF3D) for video-based action recognition. The video saliency information and audio information have been verified to be effective for improving action recognition. Further, our experimental results show that the audio information is more effective than the video saliency information for action recognition. The complementary motion information for the audio stream and the complementary video saliency information has also been proven to be
CRediT authorship contribution statement
Lei Wang: Conceptualization, Methodology, Software. Xiaoguang Yuan: Software, Data curation, Formal analysis. Ming Zong: Software, Writing - original draft. Yujun Ma: Visualization, Investigation. Wanting Ji: Software, Validation. Mingzhe Liu: Writing - review & editing. Ruili Wang: Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work was in part Supported by the National Natural Science Foundation of China under Grant U19A2086; the China Scholarship Council (CSC), and the New Zealand China Doctoral Research Scholarship.
References (48)
- et al.
Background–foreground interaction for moving object detection in dynamic scenes
Inf. Sci.
(2019) - et al.
Feature selection for least squares projection twin support vector machine
Neurocomputing
(2014) - et al.
Human action recognition via multi-task learning base on spatial-temporal feature
Inf. Sci.
(2015) - et al.
Deep visual tracking: Review and experimental comparison
Pattern Recogn.
(2018) - et al.
Stochastic configuration networks ensemble with heterogeneous features for large-scale data analytics
Inf. Sci.
(2017) - et al.
Three-stream cnns for action recognition
Pattern Recogn. Lett.
(2017) - et al.
Motion saliency based multi-stream multiplier ResNets for action recognition
Image Vis. Comput.
(2021) - et al.
Saliency guided local and global descriptors for effective action recognition
Computational Visual Media
(2016) - Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra...
- et al.
Look, listen and learn
Quo vadis, action recognition? a new model and the Kinetics dataset
Spatiotemporal multiplier networks for video action recognition
Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?
Deep residual learning for image recognition
Improving entity linking through semantic reinforced entity embeddings
Large-scale video classification with convolutional neural networks
HMDB: a large video database for human motion recognition
Gradient-based learning applied to document recognition
Proc. IEEE
Motion guided attention for video salient object detection
Cited by (23)
k-NN attention-based video vision transformer for action recognition
2024, NeurocomputingMulti-stream Global–Local Motion Fusion Network for skeleton-based action recognition
2023, Applied Soft ComputingEnhancing motion visual cues for self-supervised video representation learning
2023, Engineering Applications of Artificial IntelligenceAPSL: Action-positive separation learning for unsupervised temporal action localization
2023, Information SciencesBody part relation reasoning network for human activity understanding
2023, Information SciencesCitation Excerpt :Human activity understanding is one of the research hotspots in the field of computer vision [4,13,19,23,26,37,38].