Elsevier

Pattern Recognition

Volume 81, September 2018, Pages 23-35
Pattern Recognition

Learning content and style: Joint action recognition and person identification from human skeletons

https://doi.org/10.1016/j.patcog.2018.03.030Get rights and content

Highlights

  • We are the first to pair action recognition and person identification to imitate the ability of our visual system.

  • We propose a new end-to-end trainable pipeline, which consists of skeleton transformation and multi-task RNN.

  • We propose several novel architectures of multi-task RNN with different amounts of sharing layers.

  • Experiments show that for these two tasks, learning one task would benefit from learning another task.

Abstract

Humans are able to simultaneously identify a person and recognize his or her action based on biological motions. Previous work usually treats action recognition and person identification from motions as two separate tasks with different objectives. In this paper, we present an end-to-end framework to perform these two tasks together. Inspired by the recent success of deep recurrent neural networks (RNN) for skeleton based action recognition, we propose a new pipeline to recognize both actions and persons from skeletons extracted by RGBD sensors. The structure includes two subnets and is end-to-end trainable. The former is skeleton transformation, which accommodates viewpoint changes and noise. The latter is multi-task RNN for joint learning and various architectures are explored including a novel architecture that learns the joint probability between the two output variables. Experiments on 3D action recognition benchmark datasets demonstrate the benefits of multi-task learning and our method dramatically outperforms the existing state-of-the-art in action recognition.

Introduction

Human visual system can quickly and efficiently detect another living being performing some actions in a visual scene and recognize many aspects of biological, psychological, and social significance [1]. Biological motion contains information about actions as well as the identity of persons. The motion patterns are decomposed into content and style [2], [3]. The content represents the temporal dynamics of body poses and the style indicates the personalized style of actions which can be used for person identification. What our visual system seems to solve so effortlessly is still an unsolved problem in computer vision.

Learning content and style corresponds to two important tasks for vision based human motion understanding, i.e., action recognition and person identification from biological motion. Due to different goals, the existing methods treat them as two separate or even mutually exclusive tasks. Action recognition is concerned with what is the performed action, regardless of human subjects. The difference that different persons do the same action in various ways is the inter-class difference that has to be reduced. While person identification from biological motion addresses the question of who is the person performing the action. It aims to seek distinguishable variations between the same actions performed by different persons, allowing for an arbitrary type of actions.

Most previous approaches recognize human actions from videos. Johansson’s experiments [4] show that a large set of actions can be recognized from motions of the main joints of skeletons, which have inspired most of the literature about human body pose estimation and action recognition. Recently, skeleton based action recognition gains more popularity due to the advent of cost-effective depth sensors (e.g., Microsoft Kinect) and fast and accurate skeleton estimation algorithms from a single depth image [5]. These depth sensors support real-time non-invasive pose estimation. Currently, the Kinect v2 can physically sense depth and estimate reliable skeletons by 8 m. The area of human pose estimation in videos is also developing fast, and there are several popular benchmarks and effective methods. Compared with the video data, skeletons are more succinct and explicitly depict the dynamics of actions.

In this paper, we aim to simultaneously recognize both content and style from movements of human. We opt to consider RGBD data and learn representations from human skeletons. A novel and unified framework is proposed to conduct action recognition and person identification from human skeletons. The proposed method inherits the merits of deep recurrent neural networks (RNN) for skeleton based action recognition [6], [7], [8]. Fig. 1 shows an architecture of our method. It first learns representations from the raw skeletons, and then performs the two tasks together using the shared representations based on multi-task learning.

The proposed pipeline consists of two components: skeleton transformation for robust representation and multi-task RNN for joint learning. The former aims to address the problem of viewpoint changes and noise by the proposed viewpoint transformation layer and spatial dropout layer, respectively. The latter extends the generic RNN in a multi-task learning manner, which comprises of shared layers and task-specific layers. The shared RNN layers learn the commonalities across tasks and the task-specific RNN layers model the differences for the corresponding task. To investigate the ability of different shared and task-specific representations, we enumerate seven architectures with different amounts of sharing layers. We also examine the two special architectures. One is equivalent to two separate networks and has no shared parameter. The other is a novel architecture with no task-specific parameter, and learns the joint probability between the two output variables. We apply our model to skeleton based action recognition with cross-view evaluation to compare with the existing approaches.

In summary, the main contributions of this paper are listed as follows:

  • To the best of our knowledge, we are the first to pair action recognition and person identification inspired by the fact that our visual system can simultaneously recognize content and style from biological motions.

  • We propose a new end-to-end trainable pipeline, which consists of skeleton transformation and multi-task RNN.

  • We propose multi-task RNN with different amounts of sharing layers as well as a novel architecture that learns the joint probability between the two output variables.

  • We obtain state-of-the-art results in skeleton based action recognition. Experiments show that for these two tasks, learning one task would benefit from learning another task.

  • For person identification, we achieve a accuracy of 65.2% from novel viewpoints within 40 categories solely based on skeletons.

Section snippets

Related work

Learning content and style from skeletons is related to a range of topics, e.g., skeleton based action recognition and multi-task learning. Here we briefly review representative work on those topics.

Preliminary

Different from feedforward neural networks that map from one input vector/matrix to one output vector/matrix, recurrent neural networks (RNN) have an internal state to exhibit dynamic temporal behavior. They can process arbitrary sequences and map an input sequence to another output sequence. The hidden state representation ht at each time step t of a simple and popular RNN model on account of the input xt at the current step and the state representation ht1 of the previous step: ht=σ(Wxxt+Whht

Joint learning content and style

For joint learning content and style from sequences of human skeletons, the learning system observes two supervised learning tasks, i.e., action recognition and person identification. The goal is to simultaneously address both tasks by sharing information between them. The pipeline of learning content and style is shown in Fig. 1. The structure consists of two components: skeleton transformation for robust representation and multi-task RNN for joint learning.

Experiments and analysis

In this section, we first describe the datasets and the implementation details including the experimental setup. Then, we compare the results of different structures and analyze the distinctions between different actions. Our results of action recognition are also compared with the previous state-of-the-art results. Finally, we analyze different training methods and evaluate the parameters to draw further insights of the proposed model.

Conclusion and future work

In this paper, we present an end-to-end RNN architecture based on multi-task learning to simultaneously conduct action recognition and person identification. The structure consists of two components: skeleton transformation and multi-task RNN. For skeleton transformation, viewpoint transformation and spatial dropout are utilized to learn robust representation. For multi-task RNN, different architectures with different amounts of sharing layers are investigated. We apply the proposed model to

Acknowledgment

This work is jointly supported by National Key Research and Development Program of China (2016YFB1001000), National Natural Science Foundation of China (61525306, 61633021, 61721004 and 61420106015), Capital Science and Technology Leading Talent Training Project (Z181100006318030) and Beijing Natural Science Foundation (4162058).

Hongsong Wang received the B.S. degree in automation from Huazhong University of Science and Technology 2013. He is currently pursuing the Ph.D. degree at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China. His research interests include action recognition, video classification, and deep learning.

References (72)

  • G.E. Hinton et al.

    Improving neural networks by preventing co-adaptation of feature detectors

    CoRR

    (2012)
  • N.F. Troje

    Decomposing biological motion: a framework for analysis and synthesis of human gait patterns

    J. Vis.

    (2002)
  • J.B. Tenenbaum et al.

    Separating style and content with bilinear models

    Neural Comput.

    (2000)
  • LeeC.-S. et al.

    Gait style and gait content: bilinear models for gait recognition using gait re-sampling

    Proceedings of the 2004 IEEE Conference on Automatic Face and Gesture Recognition

    (2004)
  • G. Johansson

    Visual perception of biological motion and a model for its analysis

    Percept. Psychophys.

    (1973)
  • J. Shotton et al.

    Real-time human pose recognition in parts from single depth images

    Commun. ACM

    (2013)
  • DuY. et al.

    Hierarchical recurrent neural network for skeleton based action recognition

    Proceedings of the 2015 IEEE Conference on Conference on Computer Vision and Pattern Recognition

    (2015)
  • ZhuW. et al.

    Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks

    Proceedings of the 2015 AAAI

    (2016)
  • A. Shahroudy et al.

    NTU RGB+ d: A large scale dataset for 3d human activity analysis

    Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • ZhaoL. et al.

    Tracking human pose using max-margin Markov models

    IEEE Trans. Image Process.

    (2015)
  • ZhaoL. et al.

    Learning a tracking and estimation integrated graphical model for human pose tracking

    IEEE Trans. Neural Netw. Learn. Syst.

    (2015)
  • L.L. Presti et al.

    3D skeleton-based human action classification: a survey

    Pattern Recognit.

    (2016)
  • M.E. Hussein et al.

    Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations

    Proceedings of the 2013 International Joint Conference on Artificial Intelligence

    (2013)
  • XiaL. et al.

    View invariant human action recognition using histograms of 3D joints

    Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2012)
  • WangJ. et al.

    Mining actionlet ensemble for action recognition with depth cameras

    Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • YangX. et al.

    Eigenjoints-based action recognition using Naive-Bayes-nearest-neighbor

    Proceedings of the 2012 IEEE Conference on Computer vision and Pattern Recognition Workshops

    (2012)
  • Y. Yacoob et al.

    Parameterized modeling and recognition of activities

    Proceedings of the 1998 IEEE International Conference on Computer Vision

    (1998)
  • E. Ohn-Bar et al.

    Joint angles similarities and HOG2 for action recognition

    Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2013)
  • DuY. et al.

    Representation learning of temporal dynamics for skeleton-based action recognition

    IEEE Trans. Image Process.

    (2016)
  • V. Veeriah et al.

    Differential recurrent neural networks for action recognition

    Proceedings of the 2015 IEEE International Conference on Computer Vision

    (2015)
  • LiuJ. et al.

    Spatio-temporal LSTM with trust gates for 3D human action recognition

    Proceedings of the 2016 European Conference on Computer Vision

    (2016)
  • SongS. et al.

    An end-to-end spatio-temporal attention model for human action recognition from skeleton data

    Proceedings of the 2017 AAAI

    (2017)
  • WangH. et al.

    Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks

    Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • WangL. et al.

    Silhouette analysis-based gait recognition for human identification

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2003)
  • A.F. Bobick et al.

    Gait recognition using static, activity-specific parameters

    Proceedings of the 2001 IEEE Conference on Computer Vision and Pattern Recognition

    (2001)
  • WangL. et al.

    Fusion of static and dynamic body biometrics for gait recognition

    IEEE Trans. Circuits Syst. Video Technol.

    (2004)
  • Cited by (51)

    View all citing articles on Scopus

    Hongsong Wang received the B.S. degree in automation from Huazhong University of Science and Technology 2013. He is currently pursuing the Ph.D. degree at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China. His research interests include action recognition, video classification, and deep learning.

    Liang Wang (SM'09) received both the B.S. and M.S. degrees from Anhui University in 1997 and 2000 respectively, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (CASIA), in 2004. From 2004 to 2010, he has been working as a Research Assistant at Imperial College London, United Kingdom and Monash University, Australia, a Research Fellow at the University of Melbourne, Australia, and a lecturer at the University of Bath, United Kingdom, respectively. Currently, he is a full Professor of Hundred Talents Program at the National Lab of Pattern Recognition, CASIA. His major research interests include machine learning, pattern recognition and computer vision. He has widely published at highly-ranked international journals such as IEEE TPAMI and IEEE TIP, and leading international conferences such as CVPR, ICCV and ICDM. He is an associate editor of IEEE Transactions on SMC-B. He is currently an IAPR Fellow and Senior Member of IEEE.

    View full text