A study on attention-based LSTM for abnormal behavior recognition with variable pooling
Introduction
Behavior recognition has gain increasing interest in many research areas. It has been widely applied in public security area, including automatic identification, video surveillance, and early warning system, etc. The purpose of behavior recognition is to analyze the behavior of human body correctly by the spatial and temporal characteristics extracted from the original video clips.
Human behaviors involve many uncertain factors and the key issue is how to extract information features precisely. Behavior recognition algorithms can be divided into traditional algorithms and deep-learning based algorithms. Traditional behavior recognition algorithms usually extract features by artificial designed feature patterns and uses the extracted features as the input of the classifier to obtain behavior categories. The latest research trend in behavior recognition is based on a deep learning approach that automatically extracts features from raw visual data by using Convolutional Neural Networks (CNN). However, the parameters and weights of convolutional kernel need to be well trained, which determines the validity of the extracted feature.
Existing behavior recognition algorithms ignore the depth information and the importance of key actions in behavior recognition leading to time-consuming and low accuracy. In details, spatial and temporal information [1] are the two key factors in behavior recognition. However, most of the behavior recognition research [2], [3], [4], [5] use raw video RGB images which will lose spatio-temporal information in 3D space and it is easy to be affected by illumination, scale, and occlusion. In some studies [6], [7], [8], many approximate methods are usually used to minimize additional damage caused by data problems.
Therefore, Since the human skeleton information is robust and not affected by illumination, scaling and occlusion in the video, we propose a behavior recognition method based on human skeleton information. The method extracts the depth information of the skeleton as the input data of the behavior recognition and minimize the additional damage of illumination, scale and occlusion. Then our model which is based on spatio-temporal relationship is applied to perform behavior recognition. The proposed method combines spatial information from a single skeleton with temporal information from multiple skeletons composing multiple temporal and spatial dimensions to improve behavioral recognition accuracy. At the same time, in order for the network to recognize the skeleton behavior without being restricted by the number of people, it is necessary to change the different sizes caused by different numbers of people to the same size in the feature extraction process. We dynamically set the size of the pooling window to keep the size of the output features consistent, solve the problem of inconsistent input sizes caused by multi-person behavior recognition, and enable the network to identify multi-person skeleton sequences flexibly.
The contributions of this work can be summarized as follows:
- •
We propose a target depth estimation algorithm based on a fixed monocular camera to extract target depth information from 2D human skeleton coordinates. The method utilizes 2D images and indoor spatial information to realize the transformation from 2D skeleton to 3D skeleton through geometric methods.
- •
A skeleton behavior recognition model that is based on spatio-temporal convolution and attention-based LSTM (ST-CNN & ATT-LSTM) is proposed to obtain spatio-temporal information and deal with long-term skeleton sequences. The proposed model combines spatial information from a single skeleton with temporal information from multiple skeletons to improve behavior recognition accuracy.
- •
We propose a feature compression method based on a variable pool to dynamically handle the behavior recognition of multi-person skeleton sequences.
- •
It can be seen from the experimental comparison that the accuracy of the algorithm is higher than that of the traditional model and is at an advanced level in the deep learning model. And the proposed framework is evaluated with real-world surveillance video data, and the results indicate that our framework is superior to existing methods.
The rest of this paper is organized as follows: Section 2 describes current research of behavior recognition. Section 3 provides details of our proposed behavior recognition model. Section 4 evaluates our proposed framework with real world surveillance video data. Section 5 gives the conclusion of this paper.
Section snippets
Related work
In this section, we start by providing a brief review on the early development of human behavior recognition models. This includes a narrative on behavior recognition models that employ RGB images and videos. Then, we describe the disadvantages of these methods leading to the adoption of behavior recognition model based on skeleton sequence.
St-CNN & ATT-LSTM
In order to obtain the depth information of human and better handle long-term sequence. In this section, Combining with the idea of attention mechanism, we propose a skeleton behavior recognition model based on spatio-temporal relationships. The model includes a target depth estimation algorithm for the skeleton and a network combining spatio-temporal convolution and attention-based LSTM (ST-CNN & ATT-LSTM). The model structure is shown in Fig. 1.
Dataset
In order to evaluate the effectiveness of the proposed method, we compare the proposed model with other models in the SBU interaction dataset and the proposed LF-skeleton + 3DInfo dataset. The LF-skeleton + 3DInfo dataset is a 2D skeleton information plus skeleton depth information, in which the 2D skeleton data is extracted by Openpose [53] algorithm on LF-skeleton, and the depth estimation method is used to obtain the 3D skeleton dataset of depth information.
SBU Interaction Dataset [54]: The same
Conclusion
In this paper, we proposed a framework based on skeleton spatio-temporal relationships and attention-based behavior recognition (ST-CNN & ATT-LSTM). Based on the target depth estimation algorithm of the monocular camera, the framework acquires the depth information of the human body's 2D skeleton and identifies the behavior through the temporal and spatial relationship of the skeleton.
The fixed camera based depth estimation method simulates a real indoor scene. Through the 2D image and the real
CRediT authorship contribution statement
Kai Zhou: Conceptualization, Methodology. Bei Hui: Data curation, Software, Writing - original draft. Junfeng Wang: Validation, Investigation. Chunyu Wang: Supervision, Formal analysis. Tingting Wu: Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by the National Key R&D Program of China (No. 2018YFC0807500), by National Natural Science Foundation of China (No. U19A2059), and by Ministry of Science and Technology of Sichuan Province Program (No. 2018GZDZX0048, 20ZDYF0343).
References (58)
- et al.
Multi-stream cnn: learning representations based on human-related regions for action recognition
Pattern Recogn.
(2018) - et al.
Primary social behavior aware routing and scheduling for cognitive radio networks
Long short-term memory
Neural Comput.
(1997)- et al.
Two-stream convolutional networks for action recognition in videos
- et al.
Learning spatiotemporal features with 3d convolutional networks
- et al.
Action recognition with improved trajectories
- et al.
Trading private range counting over big iot data
- et al.
Deletion propagation for multiple key preserving conjunctive queries: Approximations and complexity
- et al.
Privacy-preserving auto-driving: a gan-based approach to protect vehicular camera data
- et al.
Recognizing actions by shape-motion prototype trees