A study on attention-based LSTM for abnormal behavior recognition with variable pooling

https://doi.org/10.1016/j.imavis.2021.104120Get rights and content

Highlights

  • Realizes the transformation from 2D skeleton to 3D skeleton through geometric methods.

  • Combines spatial and temporal information from multiple skeletons.

  • Dynamically handle the behavior recognition of multi-person skeleton sequences.

  • The proposed framework is evaluated with real-world surveillance video data.

Abstract

Behavior recognition is a well-known computer vision mobile technology. It has been used in many applications such as video surveillance, motion detection on devices, human-computer interaction and sports video, etc. However, most of the existing works ignored the depth and spatio-temporal information so that they resulted in over-fitting and inferior performance. Consequently, a novel framework for behavior recognition is proposed in this paper. In this framework, we propose a target depth estimation algorithm to calculate the 3D spatial position information of the target, and take this information as the input of the behavior recognition model. Simultaneously, in order to obtain more Spatio-temporal information and better handle long-term video, combining with the idea of attention mechanism, we propose a skeleton behavior recognition model which is based on spatio-temporal convolution and attention-based LSTM (ST-CNN & ATT-LSTM). The deep spatial information is merged into each segment, and the model focuses on the key information extraction, which is essential for improving behavior recognition performance. Meanwhile, we use a feature compression method based on variable pooling to solve the problem of inconsistent input sizes caused by multi-person behavior recognition, so that the network can flexibly recognize multi-person skeleton sequences. Finally, the proposed framework is evaluated with real-world surveillance video data, and the results indicate that our framework is superior to existing methods.

Introduction

Behavior recognition has gain increasing interest in many research areas. It has been widely applied in public security area, including automatic identification, video surveillance, and early warning system, etc. The purpose of behavior recognition is to analyze the behavior of human body correctly by the spatial and temporal characteristics extracted from the original video clips.

Human behaviors involve many uncertain factors and the key issue is how to extract information features precisely. Behavior recognition algorithms can be divided into traditional algorithms and deep-learning based algorithms. Traditional behavior recognition algorithms usually extract features by artificial designed feature patterns and uses the extracted features as the input of the classifier to obtain behavior categories. The latest research trend in behavior recognition is based on a deep learning approach that automatically extracts features from raw visual data by using Convolutional Neural Networks (CNN). However, the parameters and weights of convolutional kernel need to be well trained, which determines the validity of the extracted feature.

Existing behavior recognition algorithms ignore the depth information and the importance of key actions in behavior recognition leading to time-consuming and low accuracy. In details, spatial and temporal information [1] are the two key factors in behavior recognition. However, most of the behavior recognition research [2], [3], [4], [5] use raw video RGB images which will lose spatio-temporal information in 3D space and it is easy to be affected by illumination, scale, and occlusion. In some studies [6], [7], [8], many approximate methods are usually used to minimize additional damage caused by data problems.

Therefore, Since the human skeleton information is robust and not affected by illumination, scaling and occlusion in the video, we propose a behavior recognition method based on human skeleton information. The method extracts the depth information of the skeleton as the input data of the behavior recognition and minimize the additional damage of illumination, scale and occlusion. Then our model which is based on spatio-temporal relationship is applied to perform behavior recognition. The proposed method combines spatial information from a single skeleton with temporal information from multiple skeletons composing multiple temporal and spatial dimensions to improve behavioral recognition accuracy. At the same time, in order for the network to recognize the skeleton behavior without being restricted by the number of people, it is necessary to change the different sizes caused by different numbers of people to the same size in the feature extraction process. We dynamically set the size of the pooling window to keep the size of the output features consistent, solve the problem of inconsistent input sizes caused by multi-person behavior recognition, and enable the network to identify multi-person skeleton sequences flexibly.

The contributions of this work can be summarized as follows:

  • We propose a target depth estimation algorithm based on a fixed monocular camera to extract target depth information from 2D human skeleton coordinates. The method utilizes 2D images and indoor spatial information to realize the transformation from 2D skeleton to 3D skeleton through geometric methods.

  • A skeleton behavior recognition model that is based on spatio-temporal convolution and attention-based LSTM (ST-CNN & ATT-LSTM) is proposed to obtain spatio-temporal information and deal with long-term skeleton sequences. The proposed model combines spatial information from a single skeleton with temporal information from multiple skeletons to improve behavior recognition accuracy.

  • We propose a feature compression method based on a variable pool to dynamically handle the behavior recognition of multi-person skeleton sequences.

  • It can be seen from the experimental comparison that the accuracy of the algorithm is higher than that of the traditional model and is at an advanced level in the deep learning model. And the proposed framework is evaluated with real-world surveillance video data, and the results indicate that our framework is superior to existing methods.

The rest of this paper is organized as follows: Section 2 describes current research of behavior recognition. Section 3 provides details of our proposed behavior recognition model. Section 4 evaluates our proposed framework with real world surveillance video data. Section 5 gives the conclusion of this paper.

Section snippets

Related work

In this section, we start by providing a brief review on the early development of human behavior recognition models. This includes a narrative on behavior recognition models that employ RGB images and videos. Then, we describe the disadvantages of these methods leading to the adoption of behavior recognition model based on skeleton sequence.

St-CNN & ATT-LSTM

In order to obtain the depth information of human and better handle long-term sequence. In this section, Combining with the idea of attention mechanism, we propose a skeleton behavior recognition model based on spatio-temporal relationships. The model includes a target depth estimation algorithm for the skeleton and a network combining spatio-temporal convolution and attention-based LSTM (ST-CNN & ATT-LSTM). The model structure is shown in Fig. 1.

Dataset

In order to evaluate the effectiveness of the proposed method, we compare the proposed model with other models in the SBU interaction dataset and the proposed LF-skeleton + 3DInfo dataset. The LF-skeleton + 3DInfo dataset is a 2D skeleton information plus skeleton depth information, in which the 2D skeleton data is extracted by Openpose [53] algorithm on LF-skeleton, and the depth estimation method is used to obtain the 3D skeleton dataset of depth information.

SBU Interaction Dataset [54]: The same

Conclusion

In this paper, we proposed a framework based on skeleton spatio-temporal relationships and attention-based behavior recognition (ST-CNN & ATT-LSTM). Based on the target depth estimation algorithm of the monocular camera, the framework acquires the depth information of the human body's 2D skeleton and identifies the behavior through the temporal and spatial relationship of the skeleton.

The fixed camera based depth estimation method simulates a real indoor scene. Through the 2D image and the real

CRediT authorship contribution statement

Kai Zhou: Conceptualization, Methodology. Bei Hui: Data curation, Software, Writing - original draft. Junfeng Wang: Validation, Investigation. Chunyu Wang: Supervision, Formal analysis. Tingting Wu: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the National Key R&D Program of China (No. 2018YFC0807500), by National Natural Science Foundation of China (No. U19A2059), and by Ministry of Science and Technology of Sichuan Province Program (No. 2018GZDZX0048, 20ZDYF0343).

References (58)

  • Z. Tu et al.

    Multi-stream cnn: learning representations based on human-related regions for action recognition

    Pattern Recogn.

    (2018)
  • S. Ji et al.

    Primary social behavior aware routing and scheduling for cognitive radio networks

  • A. Graves

    Long short-term memory

    Neural Comput.

    (1997)
  • K. Simonyan et al.

    Two-stream convolutional networks for action recognition in videos

  • D. Tran et al.

    Learning spatiotemporal features with 3d convolutional networks

  • H. Wang et al.

    Action recognition with improved trajectories

  • Z. Cai et al.

    Trading private range counting over big iot data

  • Z. Cai et al.

    Deletion propagation for multiple key preserving conjunctive queries: Approximations and complexity

  • Q.H. Zuobin Xiong et al.

    Privacy-preserving auto-driving: a gan-based approach to protect vehicular camera data

  • L. Zhe et al.

    Recognizing actions by shape-motion prototype trees

  • A.A. Efros et al.

    Recognizing action at a distance

  • Dense trajectories and motion boundary descriptors for action recognition

    Int. J. Comput. Vis.

    (2013)
  • F. PERRONNIN

    Fisher Kernels on Visual Vocabularies for Image Categorization

    (2007)
  • Y.H. Ng et al.

    Beyond short snippets: Deep networks for video classification

  • Z. Shou et al.

    Cdc: Convolutional-de-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos

    (2017)
  • W. Du et al.

    Rpan: An end-to-end recurrent pose-attention network for action recognition in videos

  • A. Vaswani et al.

    Attention is All You Need, CoRR abs/1706.03762

  • K. Xu et al.

    Show, attend and tell: neural image caption generation with visual attention

    Comp. Sci.

    (2015)
  • S. Sharma et al.

    Action recognition using visual attention

  • S. Yeung et al.

    Every moment counts: dense detailed labeling of actions in complex videos

    Int. J. Comput. Vis.

    (2015)
  • K. Li et al.

    Seed-free graph de-anonymiztiation with adversarial learning

  • Z. Xiong et al.

    Adgan: protect your location privacy in camera data of auto-driving vehicles

    IEEE Trans. Indus. Inform.

    (2020)
  • X. Zheng et al.

    Privacy-preserved distinct content collection in human-assisted ubiquitous computing systems

    Inf. Sci.

    (2019)
  • X. Cheng et al.

    Human behavior recognition based on key frame

    Comp. Eng. Appl.

    (2011)
  • G. JOHANSSON

    Visual perception of biological motion and a model for its analysis

    Percept. Psychophys.

    (1973)
  • X. Yang et al.

    Eigenjoints-based action recognition using naive-bayes-nearest-neighbor

  • M.E. Hussein et al.

    Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations

  • H. Xu et al.

    R-c3d: Region convolutional 3d network for temporal activity detection

  • Y. Hu et al.

    Spatial-Temporal Fusion Convolutional Neural Network for Simulated Driving Behavior Recognition

    (2018)
  • Cited by (0)

    View full text