Spatial-temporal saliency action mask attention network for action recognition

https://doi.org/10.1016/j.jvcir.2020.102846Get rights and content

Highlights

  • The novel two-stream algorithm learns modality-specific features.

  • The key-frame mechanism reduces the inter-frame redundancy.

  • The saliency action mask attention mechanism eliminates the intra-frame redundancy.

  • The Spatial-Temporal Saliency Mask Attention Network shows superior performance.

Abstract

Recently, video action recognition about two-stream network is still a popular research topic in computer vision. However, most of current two-stream-based methods have two redundancy issues, including: inter-frame redundancy and intra-frame redundancy. To solve the above problems, a Spatial-Temporal Saliency Action Mask Attention network (STSAMANet) is built for action recognition. First, this paper introduces a key-frame mechanism to eliminate inter-frame redundancy. This mechanism can compute key frames on each video sequence to get the greatest difference between frames. Then, Mask R-CNN detection technology is introduced to build a saliency attention layer to eliminate intra-frame redundancy. This layer is to focus on the saliency human body and objects for each action class. We experiment on two public video action datasets, i.e., the UCF101 dataset and Penn Action dataset to verify the effectiveness of our method in action recognition.

Introduction

Action recognition as an important branch of computer vision has attracted a lot of attention, both in theory and in practice. Owing to the diversity of the objective environment and the subjective complexity of human action, further research on human action recognition needs to be carried on. An effective video action representation is important to solve the challenging problem. Currently, action recognition is based primarily on RGB, optical flow, skeleton or depth images. Information provided by RGB images is sensitive to illumination [1], [2], [3]. Skeleton data are easily affected by camera angle and body occlusion [4], [5]. Depth data are vulnerable to discontinuous areas, especially important body parts [6], [7]. Optical flow can make up for camera movement and highlight the contours of the moving human. It is chosen as the motion pattern of the network for its efficiency on video recognition [8], [9]. However, when faced with long-term action or fast movement, the resulting optical flow is not good. Researchers realized that the single modality data was not sufficient for action recognition, so they began to study feature fusion, such as RGB-D [6]. But with the advent of deep learning networks, researchers began to study different networks, in which two-stream network dominated [8], [9], [10], [11], [12], [13]. Accordingly, this paper investigates human action recognition based on the two-stream network.

The two-stream architecture introduces pertinent information by training the respective convolutional network on RGB and stacks of optical flow. But it encounters several common questions as follows:

  • (1)

    A fixed-size representation is required when aggregating frame-level features into video-level features. However, most networks choose to randomly extract frames. It may contain a large amount of redundant frame stacks, so that they are not discriminative enough. Extracting frames in a better way is a major problem.

  • (2)

    The network needs to learn features from each frame. When video-level features are directly transmitted to the network, most networks just manage them with low efficiency. This causes the network to use a large number of useless information. Capturing salient cues in each frame is clearly a major problem.

Considering the above two issues prevent the exceptional performance of action recognition, in this paper, we present a Spatial-Temporal Saliency Action Mask Attention (STSAMA) architecture for human action recognition. Firstly, a big issue encountered in action recognition is video representation. In most networks, we find that video frames are directly selected or randomly selected, which can easily cause high consumption problems or feature redundancy [14], [15], [16]. We introduce key-frame mechanism instead of selecting randomly. In this mechanism, the running steps include: combining static RGB frames with optical flow frames, using cluster algorithm and generating optimal cluster centers. All these steps finally divide video frames into several clusters and pick out the most discriminating frames. First, the resulting key frames are still chronological and maintain the motion sequence of videos. Also, this mechanism increases inter-frame variability, and filters out a large number of similar frames.

In addition, even if key frames are passed into our network, the recognition results for some classes are still poor due to interference caused by cluttered background in each frame [17], [18], [19], [20]. Inspired by Mask R-CNN [21], we train a saliency action detection model to obtain saliency mask maps and then build an attention layer. We embed this layer into the network to highlight the effective semantic information, including specific objects and human body. The network can guarantee inter-class differences and focus on distinctive areas in each frame. Finally, we train two networks with Bidirectional-LSTM (Bi-LSTM) [22] and C3D [23] respectively for the unique characteristics of each modality. Then we evaluate different fusion methods and training strategies to optimize the weight of our deep learning model. We carry out substantial experiments on the UCF101 dataset and Penn Action dataset, which show the validity and practicability of our method.

The main contributions in this paper include the following:

  • (1)

    We build a key-frame mechanism to increase the difference between video frames and filter out redundant frames.

  • (2)

    We build a saliency action mask attention mechanism to focus on areas of interest in each frame for action recognition.

  • (3)

    We encoded spatial and temporal stream with Bi-LSTM and C3D network respectively to get better spatio-temporal features.

The following is the structure of this paper. In Section 2, we review the existing works on action recognition. The proposed method is described in detail in Section 3. In Section 4, we explain experimental settings and discuss some results. Finally, we make a summary in Section 5.

Section snippets

Related work

There is a wide range of literature on action recognition in video, and we cannot fully cover it in this section. Among various methods, we believe and value the impact of deep learning [8], [23]. Next, we review the existing works on action recognition from three aspects: deep learning network structure [11], [23], [24], [25], [26], [27], key-frame mechanism and spatial attention.

Proposed method

In this paper, we improve on the basis of two-stream network. The key-frame mechanism and saliency action mask attention mechanism are added to our network, which are different from other general mechanisms. We discuss the following three points in detail: key-frame extraction, saliency feature extraction, feature integration and classification.

Experiments

In this section, we evaluate the proposed method on the publicly available human action datasets and compare them with other baseline methods. We first briefly describe the datasets, and then implementation details of our network are provided. We also explore the impact of our attention mechanism and compare it with the state-of-the-art approaches on real-world datasets. Finally, we discuss results and provide some insights into our algorithm.

Conclusion

In this paper, we propose a novel network called Spatial-Temporal Saliency Action Mask Attention Network (STSAMANet) to effectively resolve inter-frame redundancy and intra-frame redundancy for action recognition. In the feature representation phase, the key-frame mechanism is proposed to increase the difference between frames, which effectively uses two modality data for clustering, including RGB and optical flow data. Also, based on semantic segmentation, the saliency action mask attention

CRediT authorship contribution statement

Min Jiang: Conceptualization, Software, Writing - review & editing, Funding acquisition. Na Pan: Investigation, Data curation, Validation, Methodology, Software, Writing - original draft. Jun Kong: Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (61362030, 61201429), China Postdoctoral Science Foundation (2015M581720, 2016M600360), Jiangsu Postdoctoral Science Foundation (1601216C), Scientific and Technological Aid Program of Xinjiang (2017E0279).

References (61)

  • K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Advances in Neural...
  • Y. Zhu, Z. Lan, S. Newsam, A. Hauptmann, Hidden two-stream convolutional networks for action recognition, in: Asian...
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks: Towards good practices...
  • J. Carreira et al.

    Quo vadis, action recognition? A new model and the kinetics dataset

  • D. Tran et al.

    A closer look at spatiotemporal convolutions for action recognition

  • X. Wang et al.

    Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length

    IEEE Trans. Multimedia

    (2017)
  • C. Feichtenhofer, A. Pinz, R. Wildes, Spatiotemporal residual networks for video action recognition, in: Advances in...
  • C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in:...
  • L. Sevilla-Lara, Y. Liao, F. Güney, V. Jampani, A. Geiger, M.J. Black, On the integration of optical flow and action...
  • W. Zhu et al.

    A key volume mining deep framework for action recognition

  • A. Kar et al.

    Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos

  • B. Fernando et al.

    Rank pooling for action recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • K. Sikka et al.

    Discriminatively trained latent ordinal model for video classification

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE International Conference on...
  • A. Ullah et al.

    Action recognition in video sequences using deep bi-directional lstm with cnn features

    IEEE Access

    (2018)
  • D. Tran et al.

    Learning spatiotemporal features with 3d convolutional networks

  • J. Donahue et al.

    Long-term recurrent convolutional networks for visual recognition and description

  • Y. Cai, W. Lin, J. See, M. Cheng, G. Liu, H. Xiong, Multi-scale spatiotemporal information fusion network for video...
  • W. Lin, C. Zhang, K. Lu, B. Sheng, J. Wu, B. Ni, X. Liu, H. Xiong, Action recognition with coarse-to-fine deep feature...
  • L. Fan et al.

    End-to-end learning of motion representation for video understanding

  • Cited by (20)

    • Multi-scale attention guided network for end-to-end face alignment and recognition

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Recently, Meng et al. [8] proposed an adaptive mechanism to narrow the distance between the easy samples and class centers to well organize the within-class feature distributions, which results in the learning of universal face representation for FR. Recently, visual attention mechanisms have gained significant attention due to their extensive usage in various computer vision tasks, such as image classification [11,12,14], face recognition [23–25], action recognition [26], person re-identification [13], image segmentation [27,28], image captioning [29], and dynamic range imaging [30–32]. For instance, Hu et al. [12] proposed a module, called the Squeeze and Excitation Block, which models interdependencies between different channels by calibrating the channel-wise feature responses.

    • Cattle behavior recognition based on feature fusion under a dual attention mechanism

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Its goal is to select the more critical information from a large amount of data. As can be seen in [30–32], Jiang et al. [32] explored a spatial–temporal saliency action mask attention network for behavior recognition. Tian et al. [31] developed a solution to online multi-object tracking via a weighted correlation filters framework with a spatial–temporal attention mechanism.

    • Augmented two stream network for robust action recognition adaptive to various action videos

      2021, Journal of Visual Communication and Image Representation
    • What and how well you exercised? An efficient analysis framework for fitness actions

      2021, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Action recognition aims to identify the classification of an action, while action assessment focuses on how to match and measure the actions from a quantitative perspective. Most of the state-of-the-art action recognition methods [7–13] use deep neural network architecture in recent years. These methods are data-driven and require large annotated data.

    • Learning discriminative motion feature for enhancing multi-modal action recognition

      2021, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Although these methods show competitive performance, they are limited to the precision of the depth sensors, and the joints are likely to be wrongly estimated when there are occlusions or the action is complex. Compared to depth sensors, RGB equipment is more full-fledged and reliable, which encourages the study and application of video action recognition [3,9–12]. Recent video action recognition methods report promising results.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text