Elsevier

Neurocomputing

Volume 423, 29 January 2021, Pages 1-12
Neurocomputing

Skeleton edge motion networks for human action recognition

https://doi.org/10.1016/j.neucom.2020.10.037Get rights and content

Abstract

Human skeleton is receiving increasing attention from the community of human action recognition due to its robustness to complex image backgrounds. Previous methods usually utilize body joint-based representation, i.e., joint locations, while leaving edge-based movement poorly investigated. In this paper, we propose a new human action recognition method, skeleton edge motion networks (SEMN), to further explore the motion information of human body parts. Specifically, we address the movement of skeleton edge by using the angle changes of skeleton edge and the movement of the corresponding body joints. We then devise the proposed skeleton edge motion networks by stacking multiple spatial-temporal blocks to learn a robust deep representation from skeleton sequences. Furthermore, we propose a new progressive ranking loss to help the proposed skeleton edge motion networks maintain temporal order information in a self-supervised manner. Experimental results on five popular human action recognition datasets, PennAction, UTD-MHAD, NTU RGB+D, NTU RGB+D 120, and CSL, demonstrate the effectiveness of the proposed method.

Introduction

Human action recognition is fundamental in a variety of computer vision applications such as video surveillance [1], human–computer interaction [2], and robotics [3]. With the great success of deep learning, recent human action recognition methods usually focus on learning deep spatial-temporal representations from video clips [4], [5], [6]. Recently, human skeleton, which is obtained by either a hardware method (e.g., Kinect [7]) or a software method (e.g., human pose estimation [8]), has attracted more and more attention in human action recognition task, especially due to the rapid development of human pose estimation algorithms [9], [8]. Though human skeleton is concise representation and is robust to complex image backgrounds, it remains a challenge in learning effective spatial-temporal representations from skeleton sequences [10], [11].

Human skeleton has been widely used in action recognition tasks, in which the coordinates of human body joints are usually organized in either joint sequences, a pseudo-image, or a skeleton graph. A variety of deep neural network architectures then have been used to learn effective deep spatial-temporal representations from different input modalities, e.g., recurrent neural networks (RNNs) for joint sequences [12], [13], convolutional neural networks (CNNs) for pseudo-images [14], [15], and graph neural networks (GNNs) for skeleton graphs [11], [16]. Notice that the skeleton data can be easily combined with heatmap [15], [17] together to improve the human action representations. In this paper, we focus on human skeleton in a pseudo-image as the input to make full use of CNNs for human action recognition. However, previous CNN-based methods usually fail to explore the movement of body parts, i.e., skeleton edge motion, by using only the coordinates of human body joints [14], [11], [16].

Human body part movement is of great importance for human action recognition using skeleton sequences, while it is non-trivial to learn skeleton edge representations directly from the coordinates of human body joints using deep neural networks. Inspired by this, we introduce a new skeleton modality, skeleton edge motion, which benefits learning effective representations by exploring the movement of body parts. An intuitive example is shown in Fig. 1, in which the proposed skeleton edge motion modality contains both the rotation angle Δθ of the body part and the moving distance Δl of its corresponding body joints. We then concatenate the new skeleton edge motion modality with the original joint coordinates in channel dimension of the pseudo-image. As shown in Fig. 2(b), the proposed skeleton edge motion modality can be easily extended to other CNN-based methods. Furthermore, considering the structure of the pseudo-image, i.e., each row of the pseudo-image contains all joints in the same video frame (spatial information) and each column of the pseudo-image contains a specific body joint across all video frames (temporal information), we develop a new spatial-temporal block to learn effective spatial-temporal representations from skeleton pseudo-images. Specifically, the proposed spatial-temporal block has two branches: 1) a spatial branch with 1×k convolutional filters; and 2) a temporal branch with k×1 convolutional filters. We then devise the proposed skeleton edge motion networks by stacking multiple spatial-temporal blocks, as shown in Fig. 2(c). See more details about the spatial-temporal block in Section 3.2.

Temporal order information is crucial for reasoning relationships within complex actions and several reasoning structures have been developed for human action recognition [18], [19], [20], [21]. An intuitive example for temporal order information can be derived from a pair of actions such as “standing up” and “sitting down”, in which the most discriminative cue between two actions is the order of different video frames. An interesting observation is that both the deep action recognition model as well as ourselves will misclassify such a pair of actions if we flip all video frames across the temporal dimension, and this problem is also known as “the arrow of time in videos” [18]. Inspired by this, we address the temporal order information to further boost the performance of the proposed skeleton edge motion networks (SEMN) for human action recognition. Unlike previous works for novel reasoning structures, we propose a self-supervised progressive ranking loss to address temporal order information in the proposed skeleton edge motion networks. Specifically, the proposed loss function encourages the model to progressively make more confident predictions following the arrow of time in videos, e.g., given a set of video frames v1,v2,,vt,vt+1,pt is the prediction at the time step t, and we hope that the model can make more confident predictions at the time step t+1, i.e., ptpt+1. See more details about the proposed progressive ranking loss in Section 3.3.

The remainder of this paper is organized as follows. Section 2 gives a review of related work. Section 3 presents our skeleton edge motion networks. Section 4 demonstrates the experimental results. Section 5 concludes this paper. Our main contributions in this paper can be summarized as follows: 1) we introduce a new skeleton input modality, skeleton edge motion, for human action recognition; 2) we develop the skeleton edge motion networks (SEMN) by stacking multiple spatial-temporal blocks to learn effective deep spatial-temporal representations; and 3) we address temporal order information for human action recognition by further proposing a progressive ranking loss in a self-supervised manner. We evaluate the proposed skeleton edge motion networks on five popular human action recognition datasets, PennAction [22], UTD-MHAD [23], NTU RGB+D [24], NTU RGB+D 120 [25], and CSL [26], and experimental results demonstrate the effectiveness of the proposed method.

Section snippets

Related work

In this section, we first review previous works on human action recognition, especially for methods using human skeleton information. We then discuss the importance of temporal order information for human action recognition.

Our method

In this section, we introduce the proposed skeleton edge motion networks (SEMN) for human action recognition. Specifically, we first formulate the new skeleton edge motion modality. We then introduce the building block of SEMN, i.e., the spatial-temporal block. Lastly, we define the proposed progressive ranking loss in a self-supervised manner.

Experiments

We evaluate the proposed skeleton edge motion networks on five popular human action recognition datasets, PennAction [22], UTD-MHAD [23], NTU RGB+D [24], NTU RGB+D 120 [25], and CSL [26]. In this section, we first briefly introduce these popular datasets and our experimental setups. We then conduct a number of experiments and compare the proposed skeleton edge motion networks with current state-of-the-art methods. Lastly, we perform ablation studies on the proposed method and discuss possible

Conclusion

In this paper, we propose a new skeleton edge motion modality containing both the rotation angle and the moving distance. It characterizes the movement of body parts which are major components of human body topology to complement original body joints. The proposed skeleton edge motion modality can be easily combined with joint coordinates to generate the pseudo-image representation. In terms of the structure of these pseudo-images, we develop a spatial-temporal block to learn spatial and

CRediT authorship contribution statement

Haoran Wang: Conceptualization, Methodology, Software, Validation, Investigation, Resources, Visualization, Supervision, Project administration, Funding acquisition. Baosheng Yu: Conceptualization, Methodology, Visualization, Supervision. Kun Xia: Software, Formal analysis, Investigation, Data curation. Jiaqi Li: Software, Formal analysis, Visualization. Xin Zuo: Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported in part by National Natural Science Foundation of China (61603080, 61701101, 61871106, 61903164), the Fundamental Research Funds for the Central Universities of China (N2004022, N182608004). Baosheng Yu was partially supported by Australian Research Council Project FL-170100117.

Haoran Wang received the B.S. degree from the Department of Information Science and Technology, Northeastern University, China, in 2008, and the Ph.D. degree from the School of Automation, Southeast University, China, in 2015. In 2013, he was a visiting scholar at the Department of Computer Science of Temple University, USA. From 2018 to 2019, he was a visiting scholar at the School of Computer Science, University of Sydney. Since March 2015, he has been an assistant professor at Northeastern

References (71)

  • Z. Cao et al.

    Realtime multi-person 2d pose estimation using part affinity fields

  • A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose estimation, in: European Conference on Computer...
  • M. Li et al.

    Actional-structural graph convolutional networks for skeleton-based action recognition

  • L. Shi et al.

    Skeleton-based action recognition with directed graph neural networks

  • S. Song, C. Lan, J. Xing, W. Zeng, J. Liu, An end-to-end spatio-temporal attention model for human action recognition...
  • C. Si et al.

    Skeleton-based action recognition with spatial reasoning and temporal stack learning

  • U. Iqbal et al.

    Pose for action-action for pose

  • M. Liu et al.

    Recognizing human actions as the evolution of pose estimation maps

  • L. Shi, Y. Zhang, J. Cheng, H. Lu, Two-stream adaptive graph convolutional networks for skeleton-based action...
  • W. Du et al.

    Rpan: An end-to-end recurrent pose-attention network for action recognition in videos

  • D. Wei et al.

    Learning and using the arrow of time

  • B. Zhou et al.

    Temporal relational reasoning in videos

  • M. Zolfaghari et al.

    Eco: Efficient convolutional network for online video understanding

  • N. Hussein et al.

    Timeception for complex action recognition

  • W. Zhang et al.

    From actemes to action: A strongly-supervised representation for detailed action understanding

  • C. Chen, R. Jafari, N. Kehtarnavaz, Utd-mhad: A multimodal dataset for human action recognition utilizing a depth...
  • A. Shahroudy et al.

    Ntu rgb+ d: A large scale dataset for 3d human activity analysis

  • J. Liu et al.

    Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2019)
  • T. Liu et al.

    Sign language recognition with long short-term memory

  • W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie, Co-occurrence feature learning for skeleton based action...
  • J. Liu et al.

    Spatio-temporal lstm with trust gates for 3d human action recognition

    European Conference on Computer Vision

    (2016)
  • C. Li et al.

    Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation

  • D.C. Luvizon et al.

    2d/3d pose estimation and action recognition using multitask deep learning

  • H. Rahmani et al.

    Learning action recognition model from depth and skeleton videos

  • B. Xu, J. Li, Y. Wong, M. S. Kankanhalli, Q. Zhao, Interact as you intend: Intention-driven human-object interaction...
  • Cited by (47)

    • Skeleton-based similar action recognition through integrating the salient image feature into a center-connected graph convolutional network

      2022, Neurocomputing
      Citation Excerpt :

      They found that the sharing of sub-actions was a primary cause of the poor recognition for similar actions. Wang et al. [45] proposed a skeleton edge motion network (SEMN) to exploit the motion information of human body parts. Their results showed that a more detailed skeleton representation and the human-object interaction data were essential in recognizing a set of fine-grained actions.

    View all citing articles on Scopus

    Haoran Wang received the B.S. degree from the Department of Information Science and Technology, Northeastern University, China, in 2008, and the Ph.D. degree from the School of Automation, Southeast University, China, in 2015. In 2013, he was a visiting scholar at the Department of Computer Science of Temple University, USA. From 2018 to 2019, he was a visiting scholar at the School of Computer Science, University of Sydney. Since March 2015, he has been an assistant professor at Northeastern University, China. His research interests include human action recognition, event detection, and machine learning.

    Baosheng Yu received the B.E. degree from the University of Science and Technology of China in 2014, the Ph.D. degree from the University of Sydney in 2019. He is currently a Research Fellow in the School of Computer Science and the Faculty of Engineering, at the University of Sydney, NSW, Australia. His research interests include bandit learning, deep learning, and computer vision.

    Kun Xia received the B.E. degree from School of Electrical Engineering, Shenyang University of Technology, China. He is currently pursuing the M.E. degree at the Department of Information Science and Engineering, Northeastern University, under the supervision of Dr. Haoran Wang. His research interests include human pose estimation, human action recognition, and deep learning.

    Jiaqi Li received the B.E. degree from the Department of Information Science and Technology, Northeastern University, China, in 2018. He is currently pursuing the M.E. degree at Northeastern University under the supervision of Dr. Haoran Wang. His research interests include human action recognition, object detection, and deep learning.

    Xin Zuo received her B.S. and M.S. degree in the school of computer science from East China Shipbuilding Institute, and Jiangsu University of Science and Technology, Zhenjiang, China, in 2003 and 2007 respectively. She received her Ph.D. degree in the school of computer science and engineering from Southeast University, Nanjing, in 2014. She is currently an associate professor at school of computer science and engineering, Jiangsu University of Science and Technology, Zhenjiang, China. Her research interests include image retrieval, image registration and object detection.

    View full text