Elsevier

Information Sciences

Volume 575, October 2021, Pages 654-665
Information Sciences

Multi-cue based four-stream 3D ResNets for video-based action recognition

https://doi.org/10.1016/j.ins.2021.07.079Get rights and content

Abstract

Action recognition is one of the important computer vision tasks, which has many applications. This paper proposes a Multi-cue based Four-stream 3D ResNets (MF3D) model for action recognition. The proposed MF3D model contains four streams: a video saliency stream, an appearance stream, a motion stream and an audio stream. Four cues (i.e. the appearance cue, the motion cue, the video saliency cue and audio cue) are captured by the four streams of our proposed MF3D model. In addition, three different connections between different streams are injected, which can transfer different cues between different streams to obtain more effective spatiotemporal features. Experiments are conducted on the Kinetics and Kinetics-Sounds datasets, and the results verify that our MF3D model is effective and outperforms current existing models.

Introduction

Action recognition is a research hotspot in computer vision, which has wide application areas such as intelligent surveillance, video search and smart healthcare [10]. Video-based action recognition refers to categorising/classifying an action in a video, i.e. action-based video classification. An image usually contains two-dimensional spatial information while a video usually contains three-dimensional spatiotemporal information. In other words, a video contains extra temporal information compared with an image [31]. Thus, video-based action recognition is more challenging than image classification since three-dimensional spatiotemporal feature representation for a video is more complicated than two-dimensional spatial feature representation for an image.

Deep convolutional neural networks (CNNs) have been widely used in image processing, computer vision, natural language processing and other fields [25], [13], [24], [34], [43], e.g. image recognition [45], object detection and face recognition [46]. The main reason for this success is that deep CNNs can learn deep hierarchical visual feature representation layer by layer, which is different from conventional shallow methods [9], [39], [44]. For action recognition, many deep learning based methods were developed.

Among various deep-learning-based action recognition models [48], [47], the two-stream CNN model [26] and 3D CNN model [33] are two widely used deep learning models. The two-stream CNN model uses two different streams to separately extract spatial features and temporal features in videos. The advantage of a two-stream CNN model is that it can explicitly capture the appearance cue and motion cue separately. The 3D CNN model performs convolutions along both the spatial dimension and temporal dimension simultaneously. Thus, the 3D CNN model is a natural choice to capture the spatiotemporal cue. However, both previous two-stream CNN models and 3D CNN models ignored two important points: (i) The salient information in the videos, which is important for identifying actions. (ii) The audio information in the videos, which is important for identifying some special actions, such as playing instruments.

Recently, Ming et al.[47] proposed a motion saliency information based two-stream model, which utilized the salient motion cue to improve recognition accuracy. Yan et al. [30] proposed adding an audio stream to a two-stream model, which utilized the audio cue to enhance the accuracy of action recognition. However, they have not considered including both the salient cue and audio cue simultaneously in their models.

In addition, for 3D CNN models, Carreria and Zisserman [5] proposed a two-stream inflated 3D CNN (I3D) model for action recognition, which used 3D CNN to replace 2D CNN in a two-stream CNN model. The I3D model has been proven to be far better than the two-stream CNN model. Hara et al. [11] proposed a deep 3D ResNet for action recognition. The differences between 3D ResNet and I3D include: (i) The first difference is the input: I3D contains two streams including RGB video frames and optical flow frames as the input, while 3D ResNet only contains a single stream including RGB video frames as the input. (ii) The second difference is the network: I3D adopts the Inception network [14] as the backbone network by expanding 2D convolutions into 3D convolutions, while 3D ResNet adopts ResNet [12] as the backbone network by expanding 2D convolutions into 3D convolutions.

Based on the above considerations, we propose a novel Multi-cue based Four-stream 3D ResNets (named MF3D model for short) for video-based action recognition. The MF3D model is composed of four streams: a video saliency stream, an appearance stream, a motion stream and an audio stream. Four cues (i.e. the appearance cue, the motion cue, the video saliency cue and audio cue) are captured by the four streams of the proposed MF3D model. In addition, three different types of connections between different streams are injected, which can transfer different cues between different streams to obtain more effective spatiotemporal features.

The key contributions can be summarized as follows:

  • Four important cues (i.e. the appearance cue, the video saliency cue, the motion cue and audio cue) are proposed and integrated into a model for video-based action recognition.

  • We propose a video saliency stream guided by motion information for action recognition. Further, the interactive connections between different streams are proposed.

  • 3D ResNet is adopted as the implementation network of the four streams (i.e. the video saliency stream, the appearance stream, the motion stream and the audio stream) in our MF3D model.

The rest of this paper is organized as follows: we introduce the related work of action recognition in Section 2. The MF3D model is presented in Section 3. Experiments are described in Section 4. Finally, conclusions are drawn in Section 5.

Section snippets

Related work

Action recognition has received a lot of attentions in the last few years. The recent studies which are related to our approach are reviewed.

Our MF3D model

The proposed MF3D model for video-based action recognition is first introduced in Section 3.1. Then, the detailed structure of residual block with interactive connection in our MF3D model is presented in Section 3.2. Finally, the interactive process of forward propagation and backpropagation of the proposed MF3D model is presented in Section 3.3.

Experimental environments

We conduct all the experiments on a deep learning workstation. The CPU is Intel [email protected] GHz and the memory is 252 GB. There are two NVIDIA GPUs, i.e. two Quadro RTX 6000 GPUs, and the GPU memory of each Quadro RTX 6000 is 24 GB. Thus, the available total GPU memory of the workstation is 48 GB for experiments. The operation system is Ubuntu 20.04 version. We use Python 3 to implement the proposed MF3D model, and PyTorch 1.6 library is used to implement and compute the deep neural networks

Conclusion

We proposed a multi-cue based four-stream 3D ResNets (MF3D) for video-based action recognition. The video saliency information and audio information have been verified to be effective for improving action recognition. Further, our experimental results show that the audio information is more effective than the video saliency information for action recognition. The complementary motion information for the audio stream and the complementary video saliency information has also been proven to be

CRediT authorship contribution statement

Lei Wang: Conceptualization, Methodology, Software. Xiaoguang Yuan: Software, Data curation, Formal analysis. Ming Zong: Software, Writing - original draft. Yujun Ma: Visualization, Investigation. Wanting Ji: Software, Validation. Mingzhe Liu: Writing - review & editing. Ruili Wang: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was in part Supported by the National Natural Science Foundation of China under Grant U19A2086; the China Scholarship Council (CSC), and the New Zealand China Doctoral Research Scholarship.

References (48)

  • Yunlong Bian, Chuang Gan, Xiao Liu, Fu Li, Xiang Long, Yandong Li, Heng Qi, Jie Zhou, Shilei Wen, and Yuanqing Lin....
  • Joao Carreira et al.

    Quo vadis, action recognition? a new model and the Kinetics dataset

  • Quan-Qi Chen, Feng Liu, Xue Li, Bao-Di Liu, and Yu-Jin Zhang. Saliency-context two-stream convnets for action...
  • Christoph Feichtenhofer et al.

    Spatiotemporal multiplier networks for video action recognition

  • Kensho Hara et al.

    Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?

  • Kaiming He et al.

    Deep residual learning for image recognition

  • Feng Hou et al.

    Improving entity linking through semantic reinforced entity embeddings

  • Sergey Ioffe, Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate...
  • Yuzhu Ji, Haijun Zhang, Zequn Jie, Lin Ma, Q.M. Jonathan Wu. Casnet: a cross-attention siamese network for video...
  • Andrej Karpathy et al.

    Large-scale video classification with convolutional neural networks

  • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim...
  • Hildegard Kuehne et al.

    HMDB: a large video database for human motion recognition

  • Yann LeCun et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)
  • Haofeng Li et al.

    Motion guided attention for video salient object detection

  • Cited by (23)

    • Enhancing motion visual cues for self-supervised video representation learning

      2023, Engineering Applications of Artificial Intelligence
    • Body part relation reasoning network for human activity understanding

      2023, Information Sciences
      Citation Excerpt :

      Human activity understanding is one of the research hotspots in the field of computer vision [4,13,19,23,26,37,38].

    View all citing articles on Scopus
    View full text