Learning content and style: Joint action recognition and person identification from human skeletons
Introduction
Human visual system can quickly and efficiently detect another living being performing some actions in a visual scene and recognize many aspects of biological, psychological, and social significance [1]. Biological motion contains information about actions as well as the identity of persons. The motion patterns are decomposed into content and style [2], [3]. The content represents the temporal dynamics of body poses and the style indicates the personalized style of actions which can be used for person identification. What our visual system seems to solve so effortlessly is still an unsolved problem in computer vision.
Learning content and style corresponds to two important tasks for vision based human motion understanding, i.e., action recognition and person identification from biological motion. Due to different goals, the existing methods treat them as two separate or even mutually exclusive tasks. Action recognition is concerned with what is the performed action, regardless of human subjects. The difference that different persons do the same action in various ways is the inter-class difference that has to be reduced. While person identification from biological motion addresses the question of who is the person performing the action. It aims to seek distinguishable variations between the same actions performed by different persons, allowing for an arbitrary type of actions.
Most previous approaches recognize human actions from videos. Johansson’s experiments [4] show that a large set of actions can be recognized from motions of the main joints of skeletons, which have inspired most of the literature about human body pose estimation and action recognition. Recently, skeleton based action recognition gains more popularity due to the advent of cost-effective depth sensors (e.g., Microsoft Kinect) and fast and accurate skeleton estimation algorithms from a single depth image [5]. These depth sensors support real-time non-invasive pose estimation. Currently, the Kinect v2 can physically sense depth and estimate reliable skeletons by 8 m. The area of human pose estimation in videos is also developing fast, and there are several popular benchmarks and effective methods. Compared with the video data, skeletons are more succinct and explicitly depict the dynamics of actions.
In this paper, we aim to simultaneously recognize both content and style from movements of human. We opt to consider RGBD data and learn representations from human skeletons. A novel and unified framework is proposed to conduct action recognition and person identification from human skeletons. The proposed method inherits the merits of deep recurrent neural networks (RNN) for skeleton based action recognition [6], [7], [8]. Fig. 1 shows an architecture of our method. It first learns representations from the raw skeletons, and then performs the two tasks together using the shared representations based on multi-task learning.
The proposed pipeline consists of two components: skeleton transformation for robust representation and multi-task RNN for joint learning. The former aims to address the problem of viewpoint changes and noise by the proposed viewpoint transformation layer and spatial dropout layer, respectively. The latter extends the generic RNN in a multi-task learning manner, which comprises of shared layers and task-specific layers. The shared RNN layers learn the commonalities across tasks and the task-specific RNN layers model the differences for the corresponding task. To investigate the ability of different shared and task-specific representations, we enumerate seven architectures with different amounts of sharing layers. We also examine the two special architectures. One is equivalent to two separate networks and has no shared parameter. The other is a novel architecture with no task-specific parameter, and learns the joint probability between the two output variables. We apply our model to skeleton based action recognition with cross-view evaluation to compare with the existing approaches.
In summary, the main contributions of this paper are listed as follows:
- •
To the best of our knowledge, we are the first to pair action recognition and person identification inspired by the fact that our visual system can simultaneously recognize content and style from biological motions.
- •
We propose a new end-to-end trainable pipeline, which consists of skeleton transformation and multi-task RNN.
- •
We propose multi-task RNN with different amounts of sharing layers as well as a novel architecture that learns the joint probability between the two output variables.
- •
We obtain state-of-the-art results in skeleton based action recognition. Experiments show that for these two tasks, learning one task would benefit from learning another task.
- •
For person identification, we achieve a accuracy of 65.2% from novel viewpoints within 40 categories solely based on skeletons.
Section snippets
Related work
Learning content and style from skeletons is related to a range of topics, e.g., skeleton based action recognition and multi-task learning. Here we briefly review representative work on those topics.
Preliminary
Different from feedforward neural networks that map from one input vector/matrix to one output vector/matrix, recurrent neural networks (RNN) have an internal state to exhibit dynamic temporal behavior. They can process arbitrary sequences and map an input sequence to another output sequence. The hidden state representation ht at each time step t of a simple and popular RNN model on account of the input xt at the current step and the state representation of the previous step:
Joint learning content and style
For joint learning content and style from sequences of human skeletons, the learning system observes two supervised learning tasks, i.e., action recognition and person identification. The goal is to simultaneously address both tasks by sharing information between them. The pipeline of learning content and style is shown in Fig. 1. The structure consists of two components: skeleton transformation for robust representation and multi-task RNN for joint learning.
Experiments and analysis
In this section, we first describe the datasets and the implementation details including the experimental setup. Then, we compare the results of different structures and analyze the distinctions between different actions. Our results of action recognition are also compared with the previous state-of-the-art results. Finally, we analyze different training methods and evaluate the parameters to draw further insights of the proposed model.
Conclusion and future work
In this paper, we present an end-to-end RNN architecture based on multi-task learning to simultaneously conduct action recognition and person identification. The structure consists of two components: skeleton transformation and multi-task RNN. For skeleton transformation, viewpoint transformation and spatial dropout are utilized to learn robust representation. For multi-task RNN, different architectures with different amounts of sharing layers are investigated. We apply the proposed model to
Acknowledgment
This work is jointly supported by National Key Research and Development Program of China (2016YFB1001000), National Natural Science Foundation of China (61525306, 61633021, 61721004 and 61420106015), Capital Science and Technology Leading Talent Training Project (Z181100006318030) and Beijing Natural Science Foundation (4162058).
Hongsong Wang received the B.S. degree in automation from Huazhong University of Science and Technology 2013. He is currently pursuing the Ph.D. degree at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China. His research interests include action recognition, video classification, and deep learning.
References (72)
- et al.
A deep structure for human pose estimation
Signal Process.
(2015) - et al.
Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition
Pattern Recognit.
(2017) - et al.
Motion analysis: action detection, recognition and evaluation based on motion capture data
Pattern Recognit.
(2018) - et al.
DSRF: a flexible trajectory descriptor for articulated human action recognition
Pattern Recognit.
(2018) - et al.
Tensor-based linear dynamical systems for action recognition from 3D skeletons
Pattern Recognit.
(2018) - et al.
RGB-D-based action recognition datasets: a survey
Pattern Recognit.
(2016) - et al.
Sequence of the most informative joints (SMIJ): a new representation for human skeletal action recognition
J. Vis. Commun. Image Represent.
(2014) - et al.
Human gait recognition based on deterministic learning through multiple views fusion
Pattern Recognit. Lett.
(2016) - et al.
Hierarchical learning of multi-task sparse metrics for large-scale image classification
Pattern Recognit.
(2017) - et al.
A unified architecture for natural language processing: deep neural networks with multitask learning
Proceedings of the 2008 International Conference on Machine Learning
(2008)
Improving neural networks by preventing co-adaptation of feature detectors
CoRR
Decomposing biological motion: a framework for analysis and synthesis of human gait patterns
J. Vis.
Separating style and content with bilinear models
Neural Comput.
Gait style and gait content: bilinear models for gait recognition using gait re-sampling
Proceedings of the 2004 IEEE Conference on Automatic Face and Gesture Recognition
Visual perception of biological motion and a model for its analysis
Percept. Psychophys.
Real-time human pose recognition in parts from single depth images
Commun. ACM
Hierarchical recurrent neural network for skeleton based action recognition
Proceedings of the 2015 IEEE Conference on Conference on Computer Vision and Pattern Recognition
Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks
Proceedings of the 2015 AAAI
NTU RGB+ d: A large scale dataset for 3d human activity analysis
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition
Tracking human pose using max-margin Markov models
IEEE Trans. Image Process.
Learning a tracking and estimation integrated graphical model for human pose tracking
IEEE Trans. Neural Netw. Learn. Syst.
3D skeleton-based human action classification: a survey
Pattern Recognit.
Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations
Proceedings of the 2013 International Joint Conference on Artificial Intelligence
View invariant human action recognition using histograms of 3D joints
Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition Workshops
Mining actionlet ensemble for action recognition with depth cameras
Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition
Eigenjoints-based action recognition using Naive-Bayes-nearest-neighbor
Proceedings of the 2012 IEEE Conference on Computer vision and Pattern Recognition Workshops
Parameterized modeling and recognition of activities
Proceedings of the 1998 IEEE International Conference on Computer Vision
Joint angles similarities and HOG2 for action recognition
Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops
Representation learning of temporal dynamics for skeleton-based action recognition
IEEE Trans. Image Process.
Differential recurrent neural networks for action recognition
Proceedings of the 2015 IEEE International Conference on Computer Vision
Spatio-temporal LSTM with trust gates for 3D human action recognition
Proceedings of the 2016 European Conference on Computer Vision
An end-to-end spatio-temporal attention model for human action recognition from skeleton data
Proceedings of the 2017 AAAI
Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition
Silhouette analysis-based gait recognition for human identification
IEEE Trans. Pattern Anal. Mach. Intell.
Gait recognition using static, activity-specific parameters
Proceedings of the 2001 IEEE Conference on Computer Vision and Pattern Recognition
Fusion of static and dynamic body biometrics for gait recognition
IEEE Trans. Circuits Syst. Video Technol.
Cited by (51)
Kinematics modeling network for video-based human pose estimation
2024, Pattern RecognitionVelocity-to-velocity human motion forecasting
2022, Pattern RecognitionExploiting inter-frame regional correlation for efficient action recognition
2021, Expert Systems with ApplicationsDyadic relational graph convolutional networks for skeleton-based human interaction recognition
2021, Pattern Recognition
Hongsong Wang received the B.S. degree in automation from Huazhong University of Science and Technology 2013. He is currently pursuing the Ph.D. degree at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China. His research interests include action recognition, video classification, and deep learning.
Liang Wang (SM'09) received both the B.S. and M.S. degrees from Anhui University in 1997 and 2000 respectively, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (CASIA), in 2004. From 2004 to 2010, he has been working as a Research Assistant at Imperial College London, United Kingdom and Monash University, Australia, a Research Fellow at the University of Melbourne, Australia, and a lecturer at the University of Bath, United Kingdom, respectively. Currently, he is a full Professor of Hundred Talents Program at the National Lab of Pattern Recognition, CASIA. His major research interests include machine learning, pattern recognition and computer vision. He has widely published at highly-ranked international journals such as IEEE TPAMI and IEEE TIP, and leading international conferences such as CVPR, ICCV and ICDM. He is an associate editor of IEEE Transactions on SMC-B. He is currently an IAPR Fellow and Senior Member of IEEE.