Spatial-temporal saliency action mask attention network for action recognition☆
Introduction
Action recognition as an important branch of computer vision has attracted a lot of attention, both in theory and in practice. Owing to the diversity of the objective environment and the subjective complexity of human action, further research on human action recognition needs to be carried on. An effective video action representation is important to solve the challenging problem. Currently, action recognition is based primarily on RGB, optical flow, skeleton or depth images. Information provided by RGB images is sensitive to illumination [1], [2], [3]. Skeleton data are easily affected by camera angle and body occlusion [4], [5]. Depth data are vulnerable to discontinuous areas, especially important body parts [6], [7]. Optical flow can make up for camera movement and highlight the contours of the moving human. It is chosen as the motion pattern of the network for its efficiency on video recognition [8], [9]. However, when faced with long-term action or fast movement, the resulting optical flow is not good. Researchers realized that the single modality data was not sufficient for action recognition, so they began to study feature fusion, such as RGB-D [6]. But with the advent of deep learning networks, researchers began to study different networks, in which two-stream network dominated [8], [9], [10], [11], [12], [13]. Accordingly, this paper investigates human action recognition based on the two-stream network.
The two-stream architecture introduces pertinent information by training the respective convolutional network on RGB and stacks of optical flow. But it encounters several common questions as follows:
- (1)
A fixed-size representation is required when aggregating frame-level features into video-level features. However, most networks choose to randomly extract frames. It may contain a large amount of redundant frame stacks, so that they are not discriminative enough. Extracting frames in a better way is a major problem.
- (2)
The network needs to learn features from each frame. When video-level features are directly transmitted to the network, most networks just manage them with low efficiency. This causes the network to use a large number of useless information. Capturing salient cues in each frame is clearly a major problem.
Considering the above two issues prevent the exceptional performance of action recognition, in this paper, we present a Spatial-Temporal Saliency Action Mask Attention (STSAMA) architecture for human action recognition. Firstly, a big issue encountered in action recognition is video representation. In most networks, we find that video frames are directly selected or randomly selected, which can easily cause high consumption problems or feature redundancy [14], [15], [16]. We introduce key-frame mechanism instead of selecting randomly. In this mechanism, the running steps include: combining static RGB frames with optical flow frames, using cluster algorithm and generating optimal cluster centers. All these steps finally divide video frames into several clusters and pick out the most discriminating frames. First, the resulting key frames are still chronological and maintain the motion sequence of videos. Also, this mechanism increases inter-frame variability, and filters out a large number of similar frames.
In addition, even if key frames are passed into our network, the recognition results for some classes are still poor due to interference caused by cluttered background in each frame [17], [18], [19], [20]. Inspired by Mask R-CNN [21], we train a saliency action detection model to obtain saliency mask maps and then build an attention layer. We embed this layer into the network to highlight the effective semantic information, including specific objects and human body. The network can guarantee inter-class differences and focus on distinctive areas in each frame. Finally, we train two networks with Bidirectional-LSTM (Bi-LSTM) [22] and C3D [23] respectively for the unique characteristics of each modality. Then we evaluate different fusion methods and training strategies to optimize the weight of our deep learning model. We carry out substantial experiments on the UCF101 dataset and Penn Action dataset, which show the validity and practicability of our method.
The main contributions in this paper include the following:
- (1)
We build a key-frame mechanism to increase the difference between video frames and filter out redundant frames.
- (2)
We build a saliency action mask attention mechanism to focus on areas of interest in each frame for action recognition.
- (3)
We encoded spatial and temporal stream with Bi-LSTM and C3D network respectively to get better spatio-temporal features.
The following is the structure of this paper. In Section 2, we review the existing works on action recognition. The proposed method is described in detail in Section 3. In Section 4, we explain experimental settings and discuss some results. Finally, we make a summary in Section 5.
Section snippets
Related work
There is a wide range of literature on action recognition in video, and we cannot fully cover it in this section. Among various methods, we believe and value the impact of deep learning [8], [23]. Next, we review the existing works on action recognition from three aspects: deep learning network structure [11], [23], [24], [25], [26], [27], key-frame mechanism and spatial attention.
Proposed method
In this paper, we improve on the basis of two-stream network. The key-frame mechanism and saliency action mask attention mechanism are added to our network, which are different from other general mechanisms. We discuss the following three points in detail: key-frame extraction, saliency feature extraction, feature integration and classification.
Experiments
In this section, we evaluate the proposed method on the publicly available human action datasets and compare them with other baseline methods. We first briefly describe the datasets, and then implementation details of our network are provided. We also explore the impact of our attention mechanism and compare it with the state-of-the-art approaches on real-world datasets. Finally, we discuss results and provide some insights into our algorithm.
Conclusion
In this paper, we propose a novel network called Spatial-Temporal Saliency Action Mask Attention Network (STSAMANet) to effectively resolve inter-frame redundancy and intra-frame redundancy for action recognition. In the feature representation phase, the key-frame mechanism is proposed to increase the difference between frames, which effectively uses two modality data for clustering, including RGB and optical flow data. Also, based on semantic segmentation, the saliency action mask attention
CRediT authorship contribution statement
Min Jiang: Conceptualization, Software, Writing - review & editing, Funding acquisition. Na Pan: Investigation, Data curation, Validation, Methodology, Software, Writing - original draft. Jun Kong: Supervision, Project administration, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was partially supported by the National Natural Science Foundation of China (61362030, 61201429), China Postdoctoral Science Foundation (2015M581720, 2016M600360), Jiangsu Postdoctoral Science Foundation (1601216C), Scientific and Technological Aid Program of Xinjiang (2017E0279).
References (61)
- et al.
Joint movement similarities for robust 3d action recognition using skeletal data
J. Vis. Commun. Image Represent.
(2015) - et al.
Collaborative sparse representation leaning model for rgbd action recognition
J. Vis. Commun. Image Represent.
(2017) - et al.
Edited nearest neighbour for selecting keyframe summaries of egocentric videos
J. Vis. Commun. Image Represent.
(2018) - et al.
Videolstm convolves, attends and flows for action recognition
Comput. Vis. Image Underst.
(2018) - et al.
Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice
Comput. Vis. Image Underst.
(2016) - N. Ballas, L. Yao, C. Pal, A.C. Courville, Delving deeper into convolutional networks for learning video...
- et al.
Contextual action recognition with r∗ cnn
- et al.
Beyond gaussian pyramid: Multi-skip feature stacking for action recognition
- et al.
Human action recognition by representing 3d skeletons as points in a lie group
- et al.
Discriminative relational representation learning for rgb-d action recognition
IEEE Trans. Image Process.
(2016)
Quo vadis, action recognition? A new model and the kinetics dataset
A closer look at spatiotemporal convolutions for action recognition
Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length
IEEE Trans. Multimedia
A key volume mining deep framework for action recognition
Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos
Rank pooling for action recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Discriminatively trained latent ordinal model for video classification
IEEE Trans. Pattern Anal. Mach. Intell.
Action recognition in video sequences using deep bi-directional lstm with cnn features
IEEE Access
Learning spatiotemporal features with 3d convolutional networks
Long-term recurrent convolutional networks for visual recognition and description
End-to-end learning of motion representation for video understanding
Cited by (20)
Multi-scale attention guided network for end-to-end face alignment and recognition
2022, Journal of Visual Communication and Image RepresentationCitation Excerpt :Recently, Meng et al. [8] proposed an adaptive mechanism to narrow the distance between the easy samples and class centers to well organize the within-class feature distributions, which results in the learning of universal face representation for FR. Recently, visual attention mechanisms have gained significant attention due to their extensive usage in various computer vision tasks, such as image classification [11,12,14], face recognition [23–25], action recognition [26], person re-identification [13], image segmentation [27,28], image captioning [29], and dynamic range imaging [30–32]. For instance, Hu et al. [12] proposed a module, called the Squeeze and Excitation Block, which models interdependencies between different channels by calibrating the channel-wise feature responses.
Cattle behavior recognition based on feature fusion under a dual attention mechanism
2022, Journal of Visual Communication and Image RepresentationCitation Excerpt :Its goal is to select the more critical information from a large amount of data. As can be seen in [30–32], Jiang et al. [32] explored a spatial–temporal saliency action mask attention network for behavior recognition. Tian et al. [31] developed a solution to online multi-object tracking via a weighted correlation filters framework with a spatial–temporal attention mechanism.
Augmented two stream network for robust action recognition adaptive to various action videos
2021, Journal of Visual Communication and Image RepresentationWhat and how well you exercised? An efficient analysis framework for fitness actions
2021, Journal of Visual Communication and Image RepresentationCitation Excerpt :Action recognition aims to identify the classification of an action, while action assessment focuses on how to match and measure the actions from a quantitative perspective. Most of the state-of-the-art action recognition methods [7–13] use deep neural network architecture in recent years. These methods are data-driven and require large annotated data.
Learning discriminative motion feature for enhancing multi-modal action recognition
2021, Journal of Visual Communication and Image RepresentationCitation Excerpt :Although these methods show competitive performance, they are limited to the precision of the depth sensors, and the joints are likely to be wrongly estimated when there are occlusions or the action is complex. Compared to depth sensors, RGB equipment is more full-fledged and reliable, which encourages the study and application of video action recognition [3,9–12]. Recent video action recognition methods report promising results.
Towards efficient video-based action recognition: context-aware memory attention network
2023, SN Applied Sciences
- ☆
This paper has been recommended for acceptance by Zicheng Liu.