Elsevier

Pattern Recognition Letters

Volume 126, 1 September 2019, Pages 132-138
Pattern Recognition Letters

TGLSTM: A time based graph deep learning approach to gait recognition

https://doi.org/10.1016/j.patrec.2018.05.004Get rights and content

Abstract

We face the problem of gait recognition by using a robust deep learning model based on graphs. The proposed graph based learning approach, named Time based Graph Long Short-Term Memory (TGLSTM) network, is able to dynamically learn graphs when they may change during time, like in gait and action recognition. Indeed, the TGLSTM model jointly exploits structured data and temporal information through a deep neural network model able to learn long short-term dependencies together with graph structure. The experiments were made on popular datasets for action and gait recognition, MSR Action 3D, CAD-60, CASIA Gait B, “TUM Gait from Audio, Image and Depth” (TUM-GAID) datasets, investigating the advantages of TGLSTM with respect to state-of-the-art methods .

Introduction

Using gait as a biometric is a relatively new area of study. The gait is defined as: “A particular way or manner of moving on foot”. It has been receiving growing interest within the computer vision community and a number of gait metrics has been developed. Compared to other biometrics, gait has some unique characteristics [1]. The most attractive feature of gait as a biometric trait is its unobtrusiveness, i.e., the fact that, unlike other biometrics, it can be captured at a distance and without requiring the prior consent of the observed subject. Gait also has the advantage of being difficult to hide, steal, or fake. Anyway, gait recognition is still a very challenging problem: it relies on video sequences taken in controlled or uncontrolled environments; it is not invariant to the capturing viewpoint; it changes over time and is affected by clothes, footwear, walking surface, walking speed, and emotional condition [2].

Surely, among other issues, a critical step in gait recognition is feature extraction, i.e., the extraction, from video sequences that show a walking persons, of signals that can be used for recognition. This step is very important since there are numerous conceivable ways to extract signals from a gait video sequence, e.g., spatial, temporal, spatio-temporal, and frequency-domain feature extraction. Therefore, one must ensure that the feature extraction process compacts as much discriminatory information as possible.

In this paper, we propose to extract information from unstructured data such as video frames, generating structured description in terms of skeletons of persons and basic human actions that are present in the scene.

Skeletons were already adopted by us for object recognition with success in the past [3], as well as for human activity recognition [4], video tracking [5], person re-identification [6].

Among others, different skeleton based approaches were recently reported for action recognition. Wang et al. [7] divide the skeleton in five components and uses the spatial and temporal dictionaries of the components to represent actions, Vemulapalli et al. [8] use rotations and translations to represent the 3D geometric relationships of body components in Lie group, and then emploies Dynamic Time Warping (DTW) and Fourier Temporal Pyramid (FTP) to model the temporal dynamics, while GRUNTS [9] uses the structured feature of the skeleton to the temporal segmentation of the human action.

Other approaches use skeletal joints feature to learn a supervised model. For example, Wu and Shao [10] adopt a deep forward neural network to estimate the emission probabilities of the hidden states in HMM. Other approaches include Recurrent Neural Network (RNN) to directly classify sequences without any segmentation: Grushin et al. [11] use LSTM-RNN for robust action recognition and achieves good results on KTH dataset, Baccouche et al. [12] propose a LSTM-RNN to recognize actions regarding the histograms of optical flow as inputs. LSTM-RNNs employed in the last two cited approaches are both unidirectional with only one hidden layer, while Lefebvre et al. [13] propose a bidirectional LSTM-RNN with one forward hidden layer and one backward hidden layer for gesture classification. Du et al. [14], considering that human actions are composed of the motions of human body components, reports a method that uses a RNN in a hierarchical way.

To highlight the inner structure in data, we focus on processing streams of structured data by recursive neural networks [15], where the temporal processing which takes place in recurrent neural networks is extended to the case of graphs by connectionist models and where prior knowledge can be inserted to facilitate the interpretation of learnt structured data [16], [17].

Our method, following the ideas proposed in [18], [19], is based on a deep RNN that learns not only the skeletal joints features but also the information extracted from the changes in adjacency matrices over time. Indeed, graphs are almost always dynamic - they change shape and size - as time passes. Thus, an important part of the richness and complexity of a graph is how it changes through time; new connections are formed and old ones are broken all the time. It is the first time, at our knowledge, this feature is being considered for the problem of gait analysis and we are confident it increases its accuracy and robustness.

The model we report is named Time based Graph Long Short-Term Memory (TGLSTM); it jointly exploits structured data and temporal information through a deep neural network model able to learn long short-term dependencies together with graph structure. We demonstrate the advantage of the proposed method in both action and gait recognition, investigating the advantages of TGLSTM with respect to state-of-the-art methods on popular datasets like MSR Action 3D, CAD-60, CASIA Gait B, “TUM Gait from Audio, Image and Depth” (TUM-GAID) datasets.

The rest of the paper is structured as follows: Section 2 presents the structured features extraction used in the learning step; in Section 3 we describe the proposed Time based Graph Long Short-Term Memory (TGLSTM) model; in Section 4, we show the robustness of the TGLSTM in action recognition and gait recognition. Lastly, some conclusions are drawn also along with conducted evaluations.

Section snippets

Graph representation

The representation of the meaningful features captured at each frame has an important role in our approach. The steps to construct a graph from each frame are the following:

  • 1.

    Each frame is segmented over time to extract the foreground as the moving object against a learnt background model;

  • 2.

    A skeleton is constructed for each foreground;

  • 3.

    Each skeleton is polygonally approximated to extract features and attach them to the graph/skeleton edges

Details about each step will be provided in the following

The TGLSTM model

The Time based Graph LSTM (TGLSTM) model has been designed on the basis of recurrent neural networks concepts (Fig. 2). The network is composed of some LSTM nodes alternating to Fully Connected layers such that in input a frame based skeleton graph is fed and in output the action represented by the graph is provided.

Differently from other RNN models working on action sequences, the model works in a frame-by-frame manner.

Experimental results

The TGLSTM has been coded in Python with the auxiliary of the following open source libraries:

  • TensorFlow, that allows to develop neural networks with GPU parallelism;

  • Scikit-learn, that includes a great variety of machine learning functions;

  • Numpy, that is a Python scientific library;

  • OpenCV, that includes structures and functions for image processing.

Two different sets of experiments have been conducted. Firstly, we demonstrate the robustness of TGLSTM for action recognition problem. Lastly, the

Conclusions

We have introduced for the first time a Time based Graph neural network by using LSTM layers. The TGLSTM is composed by alternating recursive layers Long Short Time Memory (LSTM) and Fully Connected layers that catch information from both the graph nodes and its links.

We have assessed its performance on human activity recognition and gait analysis and specifically on four datasets MSR Action 3D, CAD-60, CASIA Gait B, “TUM Gait from Audio, Image and Depth” (TUM-GAID) datasets against some

References (52)

  • L. Sloman et al.

    Gait patterns of depressed patients and normal subjects

    Am. J. Psychiatry

    (1982)
  • E.R. Caianiello et al.

    Neural networks, fuzziness and image processing

  • M.E. Maresca et al.

    Clustering local motion estimates for robust and efficient object tracking

  • J. Wang et al.

    Mining actionlet ensemble for action recognition with depth cameras

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • R. Vemulapalli et al.

    Human action recognition by representing 3d skeletons as points in a lie group

    2014 IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • F. Battistone et al.

    GRUNTS: graph representation for UN supervised temporal segmentation

  • D. Wu et al.

    Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition

    2014 IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • A. Grushin et al.

    Robust human action recognition via long short-term memory

    The 2013 International Joint Conference on Neural Networks (IJCNN)

    (2013)
  • M. Baccouche et al.

    Sequential Deep Learning for Human Action Recognition

    (2011)
  • G. Lefebvre et al.

    BLSTM-RNN Based 3D Gesture Classification

    (2013)
  • Y. Du et al.

    Hierarchical recurrent neural network for skeleton based action recognition

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2015)
  • P. Frasconi et al.

    A general framework for adaptive processing of data structures

    IEEE Trans. Neural Netw.

    (1998)
  • C.W. Omlin et al.

    Stable encoding of large finite-state automata in recurrent neural networks with sigmoid discriminants

    Neural Comput.

    (1996)
  • M. Gori et al.

    Encoding nondeterministic fuzzy tree automata into recursive neural networks

    Trans. Neural. Netw.

    (2004)
  • T.N. Kipf et al.

    Semi-supervised classification with graph convolutional networks

    International Conference on Learning Representations (ICLR)

    (2017)
  • F. Scarselli et al.

    The graph neural network model

    IEEE Trans. Neural Netw.

    (2009)
  • Cited by (67)

    • GaitGCN++: Improving GCN-based gait recognition with part-wise attention and DropGraph

      2023, Journal of King Saud University - Computer and Information Sciences
    • Distilled light GaitSet: Towards scalable gait recognition

      2022, Pattern Recognition Letters
      Citation Excerpt :

      Gait Recognition — Existing methods in general can be divided into two categories: model-based methods and appearance-based methods. Model-based methods [5–7] usually extract spatial and temporal features by modeling human body, e.g., using 3D skeleton joints. Although model-based methods can be more robust for clothing and shuffling views, the high computational cost and expensive equipment (used to precisely acquire human model) hinder the scalability of these methods.

    • A unified perspective of classification-based loss and distance-based loss for cross-view gait recognition

      2022, Pattern Recognition
      Citation Excerpt :

      Due to the fixed length of input sequence, 3D CNN is usually unable to access the whole sequence which will lead to information loss. Other researchers use the LSTM (Long Short-Term Memory) to modeling temporal information [5–7]. However, LSTM architectures are not able to process temporal data in parallel, which is very time-consuming.

    View all citing articles on Scopus
    View full text