Scene image and human skeleton-based dual-stream human action recognition

doi:10.1016/j.patrec.2021.06.003

Pattern Recognition Letters

Volume 148, August 2021, Pages 136-145

https://doi.org/10.1016/j.patrec.2021.06.003 Get rights and content

Highlights

•
A scene image and human skeleton based dual-stream model is proposed.
•
A sparse sampling and video-level scene image classification method is adopted.
•
Spatio-temporal graph convolution is used for human skeleton feature extraction.
•
The proposed model has advantage of robustness to optical flow change.

Abstract

The dual stream-based human action recognition model offers the advantage of high recognition accuracy, but the algorithm is less robust in case of lighting changes. The human skeleton has a strong ability to express human behavior and actions; however, the scene information is ignored. Drawing on the idea of the dual-stream model, this paper proposes a human skeleton and scene image-based dual-stream model for human action recognition. The motion features are extracted through the spatio-temporal graph convolution of the human skeleton, and a scene recognition model is proposed based on the sparse frame sampling of video and video-level consensus strategy to process the scene video and gather the visual scene information. The proposed model exploits the advantages of skeleton information in motion expression and the superiority of the image in scene presentation. The scene information and spatio-temporal graph convolution-based human skeleton limbs are fused complementarily to achieve human action recognition. Compared to the conventional optical flow-based dual-stream action recognition method, this model is verified by experimenting under unstable light conditions, and the performance of human action recognition is robust and promising.

Introduction

In recent years, human action recognition has received widespread attention in the field of computer vision because of its vast prospects for application in video surveillance, medical rehabilitation, human–computer interaction, etc. Many scholars are working on human action recognition algorithms. Dense Trajectories (DT) is a traditional human action recognition method [1] aimed at obtaining a series of trajectories of the object in the video according to the optical flow. Subsequently, it artificially extracts the HOG, HOF, and MBH features, followed by the use of the Fisher Vector (FV) method to code these features as the input of the support vector machine (SVM) for action classification. The iDT algorithm is an improved DT algorithm [2]. The SURF key points are used to match the optical flow of the video frames, thereby weakening or even eliminating the effects of camera shake. Depth image-based action recognition is a classical approach [3]. The features can be stored for long memory, such as HOG [4]. Traditional action recognition algorithms require time-consuming and labor-intensive manual extraction of various features. Moreover, the recognition is divided into several parts, which is complicated and cannot be trained by end-to-end mode.

With the development of deep learning, many action recognition methods have been created, including dual-stream CNN, three-dimensional convolution method (C3D), and human skeleton-based methods [5]. The dual-stream method, first proposed by Simonyan in 2014, is a major human action recognition method based on deep learning [6]. It divides the model into spatial and temporal streams and always uses the optical flow [7]. A single frame of the video is selected as the input of the spatial stream, and the dense optical flow for every two adjacent frames in the video is calculated as the temporal stream input. The image information aims to extract spatial features while the optical flow information extracts temporal features. Furthermore, the image and dense optical flow convolutional models are trained separately. The classification of each branch is fused by the SVM or average method. The combination of spatial and temporal features in the video favors human action classification [8]. The advent of the dual-stream method has been a major step for deep learning-based human action recognition, and many improved methods based on it have been proposed. In 2015, Ng et al. improved it by adopting long short-term memory (LSTM) for the fusion of the temporal stream [9]. In 2016, Feichtenhofer et al. replaced the basic spatial and temporal stream convolutional networks with VGG-16 networks, improved the fusion position and method of the original dual-stream model, and achieved better results [10]. Wang et al. proposed TSN (Temporal Segment Networks), which is based on the dual-stream method to introduce a sparse time sampling algorithm to obtain a longer time video frame. They also addressed the poor recognition of long timespan actions in the method [11]. In 2017, Lan et al. improved the fusion part of the TSN by assigning different weights to different segments by autonomous learning [12]. Zhou et al. [8] proposed a TRN (temporal relation network) model that can learn and infer the temporal dependence of multi-scale frames in videos based on the TSN, thereby achieving better results [13]. C3D is another human action recognition-based deep learning, drawing on the two-dimensional convolution model, and introduces the three-dimensional convolution kernel for video processing. As early as 2013, Ji et al. proposed the three-dimensional convolution model to deal with the problem of human action recognition, extracting the spatial and temporal features [14]. In 2015, Tran et al. proposed the 3-Dimensional Convolution (C3D) method. C3D is used for large-scale supervised video datasets and suits the study of spatio-temporal features more than two-dimensional convolutional neural networks, and 3 × 3 × 3 convolution kernels are suggested for each layer in a C3D [15]. Although the classification accuracy of C3D is generally lower than that of the dual-stream method, it is more real-time than the dual-stream method with an end-to-end training mode. Moreover, the structure of the network is more concise. Therefore, it becomes a hot topic. With the development of this technology, some improved algorithms are proposed, such as I3D [16], T3D [17], and P3D [18], which have a better performance than C3D. The optical-based dual-stream model has high recognition accuracy, but the real-time performance is poor and cannot meet the actual application standards. Additionally, the algorithm is less robust in the case of lighting changes. The 3D convolution method exhibits better real-time performance. However, the accuracy is always lower than that of the dual-steam model.

The human skeleton-based action recognition method takes the human skeleton as an input to achieve human action recognition. The advantage of using the human skeleton to recognize human action is that the skeleton data directs more attention to the movement pattern of the human torso and the action. The spatio-temporal graph convolution method based on the human skeleton is a type of action recognition method. Human skeleton information can greatly reduce the size of input data. Additionally, the human movement expression ability of skeleton data is strong, eliminating the interference of background information and the impact of camera shake. In 2016, Song et al. proposed an end-to-end spatio-temporal attention model based on skeleton information. This model is built on a recurrent neural network with LSTM, which can learn selectivity and pay attention to different joint points in each frame of the skeleton [19]. Zhu et al. proposed a skeleton-based deep LSTM neural network to achieve human action recognition by mining joint co-occurrence [20]. Li et al. proposed a multi-task end-to-end joint classification-regression recurrent neural network, which adopts joint classification and regression as energy function to automatically locate the start and end points of actions more accurately [21]. In 2018, Yan et al. from the Hong Kong Chinese-Shang Tang Technology Joint Laboratory proposed the Spatial Temporal Graph Convolutional Networks (ST-GCN), which aimed to apply graph convolution to the human skeleton [22]. Since the skeleton graph is similar to the topological structure, the graph convolution network has achieved good results in processing this type of data in recent years. This concept represents a brand-new idea and has achieved good results for human action recognition. However, the skeleton information is used to lose the scene information, and it is difficult to identify the action with similar movement characteristics.

This paper proposes a human skeleton and scene image-based dual-stream model for action recognition. The motion characteristics of human limbs are extracted by the spatio-temporal graph convolutional model, and the scene information is extracted by the image convolutional model. To extract the scene information better, sparse sampling and video-level monitoring strategies are introduced to process the video. The scene and motion information are fused complementarily to achieve human action recognition, which can promote the performance of the skeleton-based action recognition model.

The remainder of this paper is organized as follows: Section 2 introduces human skeleton gathering and processing. In Section 3, the human skeleton and scene image-based action recognition model is described. Section 4 provides the simulation experimental settings and results. The concluding remarks are given in Section 5.

Section snippets

Human skeleton gathering

The skeleton gathering precedes skeleton-based human action recognition. Skeleton data can be gathered through motion capture devices [23] or be extracted from video using pose estimation algorithms. In this paper, the pose estimation algorithm (OpenPose) is used to extract the human skeleton in the video. OpenPose is an algorithm that marks the human body (neck, shoulders, elbows, etc.), connecting to bones, and then estimates the pose of the human body [24]. As a video preprocessing tool, you

Human skeleton and scene image-based dual-stream model

The dual-stream structure was inspired by a study that the human visual cortex contains two pathways, the ventral and the dorsal streams [6]. The ventral stream is responsible for object recognition and the dorsal for motion recognition. Based on this, researchers proposed the dual-stream model for human action recognition in video, whose structure is shown in Fig. 5.

The two branches of the dual-stream model are the spatial and temporal networks. The optical flow method can identify the motion

Dataset

The UCF-101 [34] and HMDB51 data set are used for verification. UCF101 is a classic action recognition data set derived from YouTube, covering a total of 101 actions. Each type of action is cut from 25+ long videos. Each video is divided into four to seven groups. The data set has 13,320 videos with a resolution of 320*240. The videos in the UCF-101 data set are diverse, including camera operation, appearance changes, posture changes, object proportion changes, background changes, fiber

Conclusion

The optical flow-based action recognition is ineffective when the scene brightness is unstable. Skeleton-based human action recognition methods possess unique advantages, enabling the algorithm to direct more attention to the human motion characteristics. However, the scene information is lost. Therefore, combining the respective advantages of the skeleton and dual-stream, a scene image and human skeleton-based dual-stream model is constructed to achieve human action recognition. The skeleton

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We want to thank Mr. Yan for making available the open source code of ST-GCN at https://github.com/yysijie/st-gcn and the developers of the source code of openpose, which is accessible at https://github.com/CMU-Perceptual-Computing-Lab/openpose. This work was supported by the National Key Research and Development Plan of China under Grant (2020AAA0108903, 2017YFB1300205), National Natural Science Foundation of China under Grants (61803227, 61573213, 61603214, 61673245).

References (58)

U. Mahbub et al.
A template matching approach of one-shot-learning gesture recognition
Pattern Recogn. Lett.
(2013)
G. Yao et al.
A review of Convolutional-Neural-Network-based action recognition
Pattern Recogn. Lett.
(2019)
M.Z. Uddin et al.
Spatio-temporal silhouette sequence reconstruction for gait recognition against occlusion
IPSJ Trans. Comput. Vision Appl.
(2019)
H. Kwon et al.
First Person Action Recognition via Two-stream ConvNet with Long-term Fusion Pooling
Pattern Recogn. Lett.
(2018)
H. Wang, A. Kläser, C. Schmid, C. Liu, Action recognition by dense trajectories, 2011 IEEE Conference on Computer...
H. Wang et al.
Action Recognition with Improved Trajectories
M.A. Rahman Ahad et al.
Action recognition based on binary patterns of action-history and histogram of oriented gradient
J. Multimodal User In.
(2016)
K. Simonyan et al.
Two-Stream Convolutional Networks for Action Recognition in Videos
U. Mahbub et al.
Action recognition based on statistical analysis from clustered flow vectors
Signal Image Video Process.
(2014)
J.Y. Ng et al.
Beyond short snippets: Deep networks for video classification

C. Feichtenhofer et al.

Convolutional Two-Stream Network Fusion for Video Action Recognition

L. Wang et al.

Z. Lan et al.

Deep Local Video Feature Action Recognit.

(2017)

B. Zhou et al.

Temporal Relational Reasoning in Videos

S. Ji et al.

3D Convolutional Neural Networks for Human Action Recognition

IEEE Trans. Pattern Anal.

(2013)

D. Tran et al.

Learning Spatiotemporal Features with 3D Convolutional Networks

J. Carreira et al.

Action Recognition? A New Model and the Kinetics Dataset

A. Diba et al.

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

arXiv

(2017)

Z. Qiu et al.

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

S. Song et al.

An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

W. Zhu et al.

Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks

Y. Li et al.

Online Human Action Detection Using Joint Classification-Regression Recurrent Neural Networks

S. Yan et al.

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

T. Hossain et al.

A method for sensor-based activity recognition in missing data scenario

Sensors (Switzerland)

(2020)

Z. Cao et al.

Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields

S. Bai et al.

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

arXiv

(2018)

L.S. Lara et al.

On the Integration of Optical Flow and Action Recognition

CoRR

(2017)

J. Liu et al.

Optical flow estimation method under the condition of illumination change

J. Image Graph.

(2014)

A. Gaidon et al.

Temporal Localization of Actions with Actoms

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

Cited by (23)

Participants-based Synchronous Optimization Network for skeleton-based action recognition
2023, Pattern Recognition Letters
Nowadays, graph convolutional networks are widely used in skeleton-based action recognition. However, these methods ignore the difference between main participant and subordinate participant, as well as the consistency and causality reasoning in human–human interactive actions. In this paper, we construct a novel Participants-based Synchronous Optimization Network (PSONet). Firstly, we construct main participant branch, subordination participant branch and relative movements branch for the individual and interactive information of participants. Secondly, in the training process, Participants-based Synchronous Response (PSR) loss is constructed to optimize our network. Online mutual response mechanism in PSR regulates the consistency and captures the causality between the main participant action and the overall interactive action. Joint cross-entropy loss in PSR is used to constrain the action instances with individual and interactive action information. Finally, Representative Temporal Enhanced (RTE) block is proposed to complement representative temporal aggregation features and enhance the spatial modeling of representative temporal frames. Experiments have been conducted on the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset, which have verified that PSONet outperforms state-of-the-art methods.
Towards reliable multi-person pose estimation using Conditional Random Fields
2023, Pattern Recognition Letters
Multi-person pose estimation is the task of estimating the coordinates of body joints and predicting the body poses for multiple people in an images. This problem has made breakthroughs in recent years, but the solutions still suffer from some shortcomings. A serious weakness of the state-of-the-art models is the number of poses detected by these models which is generally much larger than the actual number of human instances in the input image. This makes the existing models unreliable and thus unusable in real-world tasks. In this paper, we propose a more reliable multi-person pose estimation method consisting of three main blocks: a top-down multi-person pose estimation, a human detection, and a pose selection block. The proposed method incorporates the bounding box of the segmented objects to select the best subset of the initial pose set. We formulate the pose selection problem using Conditional Random Fields. First, we introduce a set of potential functions to form a general probability model. Then an inference algorithm is proposed to select the best poses which maximize the probability function. Finally, the proposed solution is implemented by a neural network. The proposed pose selection model is a model-agnostic method that can be easily used in conjunction with other pose estimation and object detection models. Experiments demonstrate that the reliability and precision of the proposed model are higher than those of the state-of-the-art models.
GaitGCN++: Improving GCN-based gait recognition with part-wise attention and DropGraph
2023, Journal of King Saud University - Computer and Information Sciences
Gait recognition is becoming one of the promising methods for biometric authentication owing to its self-effacing nature. Contemporary approaches of joint position-based gait recognition generally model gait features using spatio-temporal graphs which are often prone to overfitting. To incorporate long-range relationships among joints, these methods utilize multi-scale operators. However, they fail to provide equal importance to all joint combinations resulting in an incomplete realization of long-range relationships between joints and important body parts. Furthermore, only considering joint coordinates may fail to capture discriminatory information provided by the bone structures and motion. In this work, a novel multi-scale graph convolution approach, namely ‘GaitGCN++’, is proposed, which utilizes joint and bone information from individual frames and joint-motion data from consecutive frames providing a comprehensive understanding of gait. An efficient hop-extraction technique is utilized to understand the relationship between closer and further joints while avoiding redundant dependencies. Additionally, traditional graph convolution is enhanced by leveraging the ‘DropGraph’ regularization technique to avoid overfitting and the ‘Part-wise Attention’ to identify the most important body parts over the gait sequence. On the benchmark gait recognition dataset CASIA-B and GREW, we outperform the state-of-the-art in diversified and challenging scenarios.
Cross-scale cascade transformer for multimodal human action recognition
2023, Pattern Recognition Letters
Human action recognition can benefit from multimodal information to address the classification problem under complex situations. However, existing works either use score fusion or perform simple feature integration methods to combine multiple heterogeneous modalities which failed to effectively utilize multimodal complementary information. In this paper, we proposed a Cross-Scale Cascade Multimodal Fusion Transformer (CSCMFT) to perform interaction and fusion among modalities of multi-scale features, thus obtaining a multimodal complementary representation for RGB-D-based human action recognition. Cross-Modal Cross-Scale Mixer (CCM) is the basic component in CSCMFT, which captures cross-modal relations and propagates the fused information across scales. Furthermore, our CSCMFT can still achieve significant improvements when applied to different multimodal combinations, indicating its generality and scalability. Experimental results show that CSCMFT fully exploits complementary semantic information between RGB and depth maps and outperforms state-of-the-art RGB-D-based methods on NTU RGB+D 60 & 120 and PKU-MMD datasets.
Advances in human action, activity and gesture recognition
2022, Pattern Recognition Letters
A set of advanced approaches and models on human action, activity, gesture, and behavior recognition related aspects along with associated challenges are summarized in this note. Notably, Video-based human activity recognition, sensor-based activity analysis, skeleton-based activity recognition, assisted daily living, anomaly detection, facial expression and emotion analysis, gesture and sign language are highlighted in the works. This special issue also introduced six new datasets, while exploring a total of 55 different datasets in its 21 research articles. Apart from the the areas covered in this issue, research on multi-modal analysis, action quality assessment, real-time applications, and activity and behavior computing on edge devices are some of the dominant future challenges to deal with. We firmly believe that the original works and ideas presented in this special issue will serve as helpful references for the relevant research communities in the journey towards a brighter future.
Pose Scoring Model for Refining Multi-Person Poses
2024, SSRN

View all citing articles on Scopus

^☆: Editor: Qingyang Xu.

View full text

Scene image and human skeleton-based dual-stream human action recognition☆

Highlights

Abstract

Introduction

Section snippets

Human skeleton gathering

Human skeleton and scene image-based dual-stream model

Dataset

Conclusion

Declaration of Competing Interest

Acknowledgments

Pattern Recogn. Lett.

Pattern Recogn. Lett.

IPSJ Trans. Comput. Vision Appl.

Pattern Recogn. Lett.

Action Recognition with Improved Trajectories

Action recognition based on binary patterns of action-history and histogram of oriented gradient

J. Multimodal User In.

Two-Stream Convolutional Networks for Action Recognition in Videos

Action recognition based on statistical analysis from clustered flow vectors

Signal Image Video Process.

Beyond short snippets: Deep networks for video classification

Convolutional Two-Stream Network Fusion for Video Action Recognition

Deep Local Video Feature Action Recognit.

Temporal Relational Reasoning in Videos

3D Convolutional Neural Networks for Human Action Recognition

IEEE Trans. Pattern Anal.

Learning Spatiotemporal Features with 3D Convolutional Networks

Action Recognition? A New Model and the Kinetics Dataset

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

arXiv

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks

Online Human Action Detection Using Joint Classification-Regression Recurrent Neural Networks

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

A method for sensor-based activity recognition in missing data scenario

Sensors (Switzerland)

Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

arXiv

On the Integration of Optical Flow and Action Recognition

CoRR

Optical flow estimation method under the condition of illumination change

J. Image Graph.

Temporal Localization of Actions with Actoms

IEEE Trans. Pattern Anal. Mach. Intell.