Scene image and human skeleton-based dual-stream human action recognition☆
Introduction
In recent years, human action recognition has received widespread attention in the field of computer vision because of its vast prospects for application in video surveillance, medical rehabilitation, human–computer interaction, etc. Many scholars are working on human action recognition algorithms. Dense Trajectories (DT) is a traditional human action recognition method [1] aimed at obtaining a series of trajectories of the object in the video according to the optical flow. Subsequently, it artificially extracts the HOG, HOF, and MBH features, followed by the use of the Fisher Vector (FV) method to code these features as the input of the support vector machine (SVM) for action classification. The iDT algorithm is an improved DT algorithm [2]. The SURF key points are used to match the optical flow of the video frames, thereby weakening or even eliminating the effects of camera shake. Depth image-based action recognition is a classical approach [3]. The features can be stored for long memory, such as HOG [4]. Traditional action recognition algorithms require time-consuming and labor-intensive manual extraction of various features. Moreover, the recognition is divided into several parts, which is complicated and cannot be trained by end-to-end mode.
With the development of deep learning, many action recognition methods have been created, including dual-stream CNN, three-dimensional convolution method (C3D), and human skeleton-based methods [5]. The dual-stream method, first proposed by Simonyan in 2014, is a major human action recognition method based on deep learning [6]. It divides the model into spatial and temporal streams and always uses the optical flow [7]. A single frame of the video is selected as the input of the spatial stream, and the dense optical flow for every two adjacent frames in the video is calculated as the temporal stream input. The image information aims to extract spatial features while the optical flow information extracts temporal features. Furthermore, the image and dense optical flow convolutional models are trained separately. The classification of each branch is fused by the SVM or average method. The combination of spatial and temporal features in the video favors human action classification [8]. The advent of the dual-stream method has been a major step for deep learning-based human action recognition, and many improved methods based on it have been proposed. In 2015, Ng et al. improved it by adopting long short-term memory (LSTM) for the fusion of the temporal stream [9]. In 2016, Feichtenhofer et al. replaced the basic spatial and temporal stream convolutional networks with VGG-16 networks, improved the fusion position and method of the original dual-stream model, and achieved better results [10]. Wang et al. proposed TSN (Temporal Segment Networks), which is based on the dual-stream method to introduce a sparse time sampling algorithm to obtain a longer time video frame. They also addressed the poor recognition of long timespan actions in the method [11]. In 2017, Lan et al. improved the fusion part of the TSN by assigning different weights to different segments by autonomous learning [12]. Zhou et al. [8] proposed a TRN (temporal relation network) model that can learn and infer the temporal dependence of multi-scale frames in videos based on the TSN, thereby achieving better results [13]. C3D is another human action recognition-based deep learning, drawing on the two-dimensional convolution model, and introduces the three-dimensional convolution kernel for video processing. As early as 2013, Ji et al. proposed the three-dimensional convolution model to deal with the problem of human action recognition, extracting the spatial and temporal features [14]. In 2015, Tran et al. proposed the 3-Dimensional Convolution (C3D) method. C3D is used for large-scale supervised video datasets and suits the study of spatio-temporal features more than two-dimensional convolutional neural networks, and 3 × 3 × 3 convolution kernels are suggested for each layer in a C3D [15]. Although the classification accuracy of C3D is generally lower than that of the dual-stream method, it is more real-time than the dual-stream method with an end-to-end training mode. Moreover, the structure of the network is more concise. Therefore, it becomes a hot topic. With the development of this technology, some improved algorithms are proposed, such as I3D [16], T3D [17], and P3D [18], which have a better performance than C3D. The optical-based dual-stream model has high recognition accuracy, but the real-time performance is poor and cannot meet the actual application standards. Additionally, the algorithm is less robust in the case of lighting changes. The 3D convolution method exhibits better real-time performance. However, the accuracy is always lower than that of the dual-steam model.
The human skeleton-based action recognition method takes the human skeleton as an input to achieve human action recognition. The advantage of using the human skeleton to recognize human action is that the skeleton data directs more attention to the movement pattern of the human torso and the action. The spatio-temporal graph convolution method based on the human skeleton is a type of action recognition method. Human skeleton information can greatly reduce the size of input data. Additionally, the human movement expression ability of skeleton data is strong, eliminating the interference of background information and the impact of camera shake. In 2016, Song et al. proposed an end-to-end spatio-temporal attention model based on skeleton information. This model is built on a recurrent neural network with LSTM, which can learn selectivity and pay attention to different joint points in each frame of the skeleton [19]. Zhu et al. proposed a skeleton-based deep LSTM neural network to achieve human action recognition by mining joint co-occurrence [20]. Li et al. proposed a multi-task end-to-end joint classification-regression recurrent neural network, which adopts joint classification and regression as energy function to automatically locate the start and end points of actions more accurately [21]. In 2018, Yan et al. from the Hong Kong Chinese-Shang Tang Technology Joint Laboratory proposed the Spatial Temporal Graph Convolutional Networks (ST-GCN), which aimed to apply graph convolution to the human skeleton [22]. Since the skeleton graph is similar to the topological structure, the graph convolution network has achieved good results in processing this type of data in recent years. This concept represents a brand-new idea and has achieved good results for human action recognition. However, the skeleton information is used to lose the scene information, and it is difficult to identify the action with similar movement characteristics.
This paper proposes a human skeleton and scene image-based dual-stream model for action recognition. The motion characteristics of human limbs are extracted by the spatio-temporal graph convolutional model, and the scene information is extracted by the image convolutional model. To extract the scene information better, sparse sampling and video-level monitoring strategies are introduced to process the video. The scene and motion information are fused complementarily to achieve human action recognition, which can promote the performance of the skeleton-based action recognition model.
The remainder of this paper is organized as follows: Section 2 introduces human skeleton gathering and processing. In Section 3, the human skeleton and scene image-based action recognition model is described. Section 4 provides the simulation experimental settings and results. The concluding remarks are given in Section 5.
Section snippets
Human skeleton gathering
The skeleton gathering precedes skeleton-based human action recognition. Skeleton data can be gathered through motion capture devices [23] or be extracted from video using pose estimation algorithms. In this paper, the pose estimation algorithm (OpenPose) is used to extract the human skeleton in the video. OpenPose is an algorithm that marks the human body (neck, shoulders, elbows, etc.), connecting to bones, and then estimates the pose of the human body [24]. As a video preprocessing tool, you
Human skeleton and scene image-based dual-stream model
The dual-stream structure was inspired by a study that the human visual cortex contains two pathways, the ventral and the dorsal streams [6]. The ventral stream is responsible for object recognition and the dorsal for motion recognition. Based on this, researchers proposed the dual-stream model for human action recognition in video, whose structure is shown in Fig. 5.
The two branches of the dual-stream model are the spatial and temporal networks. The optical flow method can identify the motion
Dataset
The UCF-101 [34] and HMDB51 data set are used for verification. UCF101 is a classic action recognition data set derived from YouTube, covering a total of 101 actions. Each type of action is cut from 25+ long videos. Each video is divided into four to seven groups. The data set has 13,320 videos with a resolution of 320*240. The videos in the UCF-101 data set are diverse, including camera operation, appearance changes, posture changes, object proportion changes, background changes, fiber
Conclusion
The optical flow-based action recognition is ineffective when the scene brightness is unstable. Skeleton-based human action recognition methods possess unique advantages, enabling the algorithm to direct more attention to the human motion characteristics. However, the scene information is lost. Therefore, combining the respective advantages of the skeleton and dual-stream, a scene image and human skeleton-based dual-stream model is constructed to achieve human action recognition. The skeleton
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We want to thank Mr. Yan for making available the open source code of ST-GCN at https://github.com/yysijie/st-gcn and the developers of the source code of openpose, which is accessible at https://github.com/CMU-Perceptual-Computing-Lab/openpose. This work was supported by the National Key Research and Development Plan of China under Grant (2020AAA0108903, 2017YFB1300205), National Natural Science Foundation of China under Grants (61803227, 61573213, 61603214, 61673245).
References (58)
- et al.
A template matching approach of one-shot-learning gesture recognition
Pattern Recogn. Lett.
(2013) - et al.
A review of Convolutional-Neural-Network-based action recognition
Pattern Recogn. Lett.
(2019) - et al.
Spatio-temporal silhouette sequence reconstruction for gait recognition against occlusion
IPSJ Trans. Comput. Vision Appl.
(2019) - et al.
First Person Action Recognition via Two-stream ConvNet with Long-term Fusion Pooling
Pattern Recogn. Lett.
(2018) - H. Wang, A. Kläser, C. Schmid, C. Liu, Action recognition by dense trajectories, 2011 IEEE Conference on Computer...
- et al.
Action Recognition with Improved Trajectories
- et al.
Action recognition based on binary patterns of action-history and histogram of oriented gradient
J. Multimodal User In.
(2016) - et al.
Two-Stream Convolutional Networks for Action Recognition in Videos
- et al.
Action recognition based on statistical analysis from clustered flow vectors
Signal Image Video Process.
(2014) - et al.
Beyond short snippets: Deep networks for video classification
Convolutional Two-Stream Network Fusion for Video Action Recognition
Deep Local Video Feature Action Recognit.
Temporal Relational Reasoning in Videos
3D Convolutional Neural Networks for Human Action Recognition
IEEE Trans. Pattern Anal.
Learning Spatiotemporal Features with 3D Convolutional Networks
Action Recognition? A New Model and the Kinetics Dataset
Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification
arXiv
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks
An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data
Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks
Online Human Action Detection Using Joint Classification-Regression Recurrent Neural Networks
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
A method for sensor-based activity recognition in missing data scenario
Sensors (Switzerland)
Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
arXiv
On the Integration of Optical Flow and Action Recognition
CoRR
Optical flow estimation method under the condition of illumination change
J. Image Graph.
Temporal Localization of Actions with Actoms
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (23)
Participants-based Synchronous Optimization Network for skeleton-based action recognition
2023, Pattern Recognition LettersTowards reliable multi-person pose estimation using Conditional Random Fields
2023, Pattern Recognition LettersGaitGCN++: Improving GCN-based gait recognition with part-wise attention and DropGraph
2023, Journal of King Saud University - Computer and Information SciencesCross-scale cascade transformer for multimodal human action recognition
2023, Pattern Recognition LettersAdvances in human action, activity and gesture recognition
2022, Pattern Recognition Letters
- ☆
Editor: Qingyang Xu.