Skeleton edge motion networks for human action recognition
Introduction
Human action recognition is fundamental in a variety of computer vision applications such as video surveillance [1], human–computer interaction [2], and robotics [3]. With the great success of deep learning, recent human action recognition methods usually focus on learning deep spatial-temporal representations from video clips [4], [5], [6]. Recently, human skeleton, which is obtained by either a hardware method (e.g., Kinect [7]) or a software method (e.g., human pose estimation [8]), has attracted more and more attention in human action recognition task, especially due to the rapid development of human pose estimation algorithms [9], [8]. Though human skeleton is concise representation and is robust to complex image backgrounds, it remains a challenge in learning effective spatial-temporal representations from skeleton sequences [10], [11].
Human skeleton has been widely used in action recognition tasks, in which the coordinates of human body joints are usually organized in either joint sequences, a pseudo-image, or a skeleton graph. A variety of deep neural network architectures then have been used to learn effective deep spatial-temporal representations from different input modalities, e.g., recurrent neural networks (RNNs) for joint sequences [12], [13], convolutional neural networks (CNNs) for pseudo-images [14], [15], and graph neural networks (GNNs) for skeleton graphs [11], [16]. Notice that the skeleton data can be easily combined with heatmap [15], [17] together to improve the human action representations. In this paper, we focus on human skeleton in a pseudo-image as the input to make full use of CNNs for human action recognition. However, previous CNN-based methods usually fail to explore the movement of body parts, i.e., skeleton edge motion, by using only the coordinates of human body joints [14], [11], [16].
Human body part movement is of great importance for human action recognition using skeleton sequences, while it is non-trivial to learn skeleton edge representations directly from the coordinates of human body joints using deep neural networks. Inspired by this, we introduce a new skeleton modality, skeleton edge motion, which benefits learning effective representations by exploring the movement of body parts. An intuitive example is shown in Fig. 1, in which the proposed skeleton edge motion modality contains both the rotation angle of the body part and the moving distance of its corresponding body joints. We then concatenate the new skeleton edge motion modality with the original joint coordinates in channel dimension of the pseudo-image. As shown in Fig. 2(b), the proposed skeleton edge motion modality can be easily extended to other CNN-based methods. Furthermore, considering the structure of the pseudo-image, i.e., each row of the pseudo-image contains all joints in the same video frame (spatial information) and each column of the pseudo-image contains a specific body joint across all video frames (temporal information), we develop a new spatial-temporal block to learn effective spatial-temporal representations from skeleton pseudo-images. Specifically, the proposed spatial-temporal block has two branches: 1) a spatial branch with convolutional filters; and 2) a temporal branch with convolutional filters. We then devise the proposed skeleton edge motion networks by stacking multiple spatial-temporal blocks, as shown in Fig. 2(c). See more details about the spatial-temporal block in Section 3.2.
Temporal order information is crucial for reasoning relationships within complex actions and several reasoning structures have been developed for human action recognition [18], [19], [20], [21]. An intuitive example for temporal order information can be derived from a pair of actions such as “standing up” and “sitting down”, in which the most discriminative cue between two actions is the order of different video frames. An interesting observation is that both the deep action recognition model as well as ourselves will misclassify such a pair of actions if we flip all video frames across the temporal dimension, and this problem is also known as “the arrow of time in videos” [18]. Inspired by this, we address the temporal order information to further boost the performance of the proposed skeleton edge motion networks (SEMN) for human action recognition. Unlike previous works for novel reasoning structures, we propose a self-supervised progressive ranking loss to address temporal order information in the proposed skeleton edge motion networks. Specifically, the proposed loss function encourages the model to progressively make more confident predictions following the arrow of time in videos, e.g., given a set of video frames is the prediction at the time step t, and we hope that the model can make more confident predictions at the time step , i.e., . See more details about the proposed progressive ranking loss in Section 3.3.
The remainder of this paper is organized as follows. Section 2 gives a review of related work. Section 3 presents our skeleton edge motion networks. Section 4 demonstrates the experimental results. Section 5 concludes this paper. Our main contributions in this paper can be summarized as follows: 1) we introduce a new skeleton input modality, skeleton edge motion, for human action recognition; 2) we develop the skeleton edge motion networks (SEMN) by stacking multiple spatial-temporal blocks to learn effective deep spatial-temporal representations; and 3) we address temporal order information for human action recognition by further proposing a progressive ranking loss in a self-supervised manner. We evaluate the proposed skeleton edge motion networks on five popular human action recognition datasets, PennAction [22], UTD-MHAD [23], NTU RGB+D [24], NTU RGB+D 120 [25], and CSL [26], and experimental results demonstrate the effectiveness of the proposed method.
Section snippets
Related work
In this section, we first review previous works on human action recognition, especially for methods using human skeleton information. We then discuss the importance of temporal order information for human action recognition.
Our method
In this section, we introduce the proposed skeleton edge motion networks (SEMN) for human action recognition. Specifically, we first formulate the new skeleton edge motion modality. We then introduce the building block of SEMN, i.e., the spatial-temporal block. Lastly, we define the proposed progressive ranking loss in a self-supervised manner.
Experiments
We evaluate the proposed skeleton edge motion networks on five popular human action recognition datasets, PennAction [22], UTD-MHAD [23], NTU RGB+D [24], NTU RGB+D 120 [25], and CSL [26]. In this section, we first briefly introduce these popular datasets and our experimental setups. We then conduct a number of experiments and compare the proposed skeleton edge motion networks with current state-of-the-art methods. Lastly, we perform ablation studies on the proposed method and discuss possible
Conclusion
In this paper, we propose a new skeleton edge motion modality containing both the rotation angle and the moving distance. It characterizes the movement of body parts which are major components of human body topology to complement original body joints. The proposed skeleton edge motion modality can be easily combined with joint coordinates to generate the pseudo-image representation. In terms of the structure of these pseudo-images, we develop a spatial-temporal block to learn spatial and
CRediT authorship contribution statement
Haoran Wang: Conceptualization, Methodology, Software, Validation, Investigation, Resources, Visualization, Supervision, Project administration, Funding acquisition. Baosheng Yu: Conceptualization, Methodology, Visualization, Supervision. Kun Xia: Software, Formal analysis, Investigation, Data curation. Jiaqi Li: Software, Formal analysis, Visualization. Xin Zuo: Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work is supported in part by National Natural Science Foundation of China (61603080, 61701101, 61871106, 61903164), the Fundamental Research Funds for the Central Universities of China (N2004022, N182608004). Baosheng Yu was partially supported by Australian Research Council Project FL-170100117.
Haoran Wang received the B.S. degree from the Department of Information Science and Technology, Northeastern University, China, in 2008, and the Ph.D. degree from the School of Automation, Southeast University, China, in 2015. In 2013, he was a visiting scholar at the Department of Computer Science of Temple University, USA. From 2018 to 2019, he was a visiting scholar at the School of Computer Science, University of Sydney. Since March 2015, he has been an assistant professor at Northeastern
References (71)
- et al.
Dual-layer kernel extreme learning machine for action recognition
Neurocomputing
(2017) - et al.
Human action recognition using extreme learning machine based on visual vocabularies
Neurocomputing
(2010) - et al.
Combining appearance and structural features for human action recognition
Neurocomputing
(2013) - et al.
Weighted feature trajectories and concatenated bag-of-features for action recognition
Neurocomputing
(2014) - et al.
Skeleton-based action recognition with extreme learning machines
Neurocomputing
(2015) - et al.
Sequence of the most informative joints (smij): A new representation for human skeletal action recognition
Journal of Visual Communication and Image Representation
(2014) - et al.
A closer look at spatiotemporal convolutions for action recognition
- et al.
Mict: Mixed 3d/2d convolutional tube for human action recognition
- et al.
Body joint guided 3-d deep convolutional descriptors for action recognition
IEEE Transactions on Cybernetics
(2017) Microsoft kinect sensor and its effect
IEEE Multimedia
(2012)
Realtime multi-person 2d pose estimation using part affinity fields
Actional-structural graph convolutional networks for skeleton-based action recognition
Skeleton-based action recognition with directed graph neural networks
Skeleton-based action recognition with spatial reasoning and temporal stack learning
Pose for action-action for pose
Recognizing human actions as the evolution of pose estimation maps
Rpan: An end-to-end recurrent pose-attention network for action recognition in videos
Learning and using the arrow of time
Temporal relational reasoning in videos
Eco: Efficient convolutional network for online video understanding
Timeception for complex action recognition
From actemes to action: A strongly-supervised representation for detailed action understanding
Ntu rgb+ d: A large scale dataset for 3d human activity analysis
Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding
IEEE Transactions on Pattern Analysis and Machine Intelligence
Sign language recognition with long short-term memory
Spatio-temporal lstm with trust gates for 3d human action recognition
European Conference on Computer Vision
Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation
2d/3d pose estimation and action recognition using multitask deep learning
Learning action recognition model from depth and skeleton videos
Cited by (47)
Human activity classification using deep learning based on 3D motion feature
2023, Machine Learning with ApplicationsAFE-CNN: 3D Skeleton-based Action Recognition with Action Feature Enhancement
2022, NeurocomputingSkeleton-based similar action recognition through integrating the salient image feature into a center-connected graph convolutional network
2022, NeurocomputingCitation Excerpt :They found that the sharing of sub-actions was a primary cause of the poor recognition for similar actions. Wang et al. [45] proposed a skeleton edge motion network (SEMN) to exploit the motion information of human body parts. Their results showed that a more detailed skeleton representation and the human-object interaction data were essential in recognizing a set of fine-grained actions.
A Combination Model of Shifting Joint Angle Changes With 3D-Deep Convolutional Neural Network to Recognize Human Activity
2024, IEEE Transactions on Neural Systems and Rehabilitation Engineering
Haoran Wang received the B.S. degree from the Department of Information Science and Technology, Northeastern University, China, in 2008, and the Ph.D. degree from the School of Automation, Southeast University, China, in 2015. In 2013, he was a visiting scholar at the Department of Computer Science of Temple University, USA. From 2018 to 2019, he was a visiting scholar at the School of Computer Science, University of Sydney. Since March 2015, he has been an assistant professor at Northeastern University, China. His research interests include human action recognition, event detection, and machine learning.
Baosheng Yu received the B.E. degree from the University of Science and Technology of China in 2014, the Ph.D. degree from the University of Sydney in 2019. He is currently a Research Fellow in the School of Computer Science and the Faculty of Engineering, at the University of Sydney, NSW, Australia. His research interests include bandit learning, deep learning, and computer vision.
Kun Xia received the B.E. degree from School of Electrical Engineering, Shenyang University of Technology, China. He is currently pursuing the M.E. degree at the Department of Information Science and Engineering, Northeastern University, under the supervision of Dr. Haoran Wang. His research interests include human pose estimation, human action recognition, and deep learning.
Jiaqi Li received the B.E. degree from the Department of Information Science and Technology, Northeastern University, China, in 2018. He is currently pursuing the M.E. degree at Northeastern University under the supervision of Dr. Haoran Wang. His research interests include human action recognition, object detection, and deep learning.
Xin Zuo received her B.S. and M.S. degree in the school of computer science from East China Shipbuilding Institute, and Jiangsu University of Science and Technology, Zhenjiang, China, in 2003 and 2007 respectively. She received her Ph.D. degree in the school of computer science and engineering from Southeast University, Nanjing, in 2014. She is currently an associate professor at school of computer science and engineering, Jiangsu University of Science and Technology, Zhenjiang, China. Her research interests include image retrieval, image registration and object detection.