Elsevier

Pattern Recognition Letters

Volume 135, July 2020, Pages 286-292
Pattern Recognition Letters

Topology-learnable graph convolution for skeleton-based action recognition

https://doi.org/10.1016/j.patrec.2020.05.005Get rights and content

Highlights

  • An effective method is proposed to generalize convolutions to the graph domain.

  • The latent graph topologies beyond the handcrafted topology can be learned adaptively.

  • A topology-learnable graph convolution network is constructed.

  • The self-learning process makes the graph convolution more flexible and universal.

Abstract

Graph convolutional networks (GCNs) generalize convolutional neural networks into irregular graph-like structures. Generally, graph topologies are set by hand and fixed over all layers. Handcrafted connections may not be optimal and cannot fully use the self-learning ability of deep learning. In this work, we explore a topology-learnable graph convolution for skeleton-based action recognition. Specifically, a spatial graph convolution can be decomposed into a feature learning component that evolves the features of each graph vertex, and a graph vertex fusion component in which the latent graph topologies can be learned adaptively. Different initialization strategies for the learnable fusion matrix are evaluated. Experimental results that are based on the spatial-temporal GCNs for skeleton-based action recognition, demonstrate that convolution can work on graphs like on images, even if only a specific fusion matrix initialization that uses adjacency matrices is applied. Moreover, the self-learning process can learn the latent topology of a graph beyond the handcrafted topology, thereby making graph convolution flexible and universal.

Introduction

Convolutional neural networks (CNNs) [4] have achieved great success in machine learning fields, such as one-dimensional speech [5], two-dimensional images [11] and three-dimensional videos [27], where the underlying data representation has a grid-like structure. Local convolution filters with learnable parameters can be reused efficiently on all input positions. However, many fields involve data that cannot be represented in a grid-like structure. Such kinds of data generally lie in an irregular domain and can be usually represented in graph-based forms. Irregular structures prevent CNNs from being generalized into the graph domain.

Nevertheless, generalizing convolutions to the graph domain is an emerging topic in deep learning research. The advances in constructing graph convolutional networks (GCNs) on graphs are generally categorized into 1) spectral and 2) spatial perspectives. Spatial perspective defines convolutions directly on the graph within k-step neighbors [1], [2]. However, maintaining the weight sharing property of CNNs on different-sized neighborhoods is challenging. Generally, a transition matrix is constructed to define the neighborhood for each node, and a specific weight matrix is required for each node degree [1], [2]. Niepert et al. [18] assembled and normalized a fixed-size neighborhood for each node in the selected fixed-length sequence and learned neighborhood representations with CNNs. In such cases, GCNs are applied in the transductive setting with fixed graphs.

Handcrafted and normalized adjacency matrices are usually fixed over all layers when constructing GCNs. However, the handcrafted graph topology may not be optimal. Hamilton et al. [6] proposed an inductive framework that simultaneously learned the topological structure of each node’s neighborhood and the distribution of node features in the neighborhood. A set of aggregator functions is learned to aggregate the features from a node’s local neighborhood. Velickovic et al. [28] proposed an attention-based architecture for the node classification of graph-structured data. The proposed self-attention strategy learns the adaptive relationships between a node and its neighbors. This method makes the weight matrix of neighborhoods learnable. Shi et al. [23] overcame the constraint of local neighborhood by introducing the non-local idea into GCNs and constructed non-local GCNs based on spatial-temporal GCNs (ST-GCNs) [30] for skeleton-based action recognition.

Inspired by these works, we explore spatial graph convolutions based on the ST-GCN to fully use the self-learning ability of deep learning. Generally, spatial perspective graph convolutions can be implemented in two steps: (a) feature learning and (b) graph vertex fusion. The feature learning operations for each node can be implemented using a 1 × 1 2D-convolution [23], [30] or matrix multiplication [28]. Graph vertex fusion can be performed by matrix multiplication with a specifically initialized topology-learnable weight.

Unlike the non-local strategy [23], which breaks the constraint of local neighborhood by introducing the non-local idea into GCNs, we believe that the proposed topology-learnable graph convolution processes an inherent ability to break the local constraint. The non-local idea is redundant when the traditional k-step adjacency matrix is changed into topology-learnable ones. In contrast with the self-attention strategy [28], which learns the adaptive relationships between a node and its neighbors, we extend the adaptive relationships to the entire graph.

The remainder of this work is organized as follows. Section 2 reviews the related works. Section 3 provides a detailed description and analysis of the proposed topology-learnable graph convolution, followed by the experiments in Section 4. Finally, Section 5 presents the conclusion.

Section snippets

Graph convolutional network for action recognition

Skeletal action recognition has been widely explored using conventional methods with handcrafted features or neural network based methods. The human skeleton can be naturally represented by a graph, in which each vertex represents one human joint. Thus, human actions can be viewed as spatial temporal graphs. In recent years, researchers have generalized convolutions to the graph domain for skeleton-based action recognition.

Yan et al. [30] proposed ST-GCNs for skeleton-based action recognition.

Graph convolution

The skeleton data in one frame of human action sequences are always provided as a vector sequence. Each vector represents the 2D or 3D coordinates of one human joint. According to the kinematics model of the human body, the skeleton data can be naturally represented by a graph. In the ST-GCN [30], the skeleton graph is constructed using joints as vertices and bones as edges. The corresponding joints in adjacent frames are connected as time edges. The spatial-temporal skeleton graph can be

Experiments

Our research is based on ST-GCN [30] and 2s-AGCN [23]. Therefore, our experiments are conducted on the two same large-scale action recognition datasets: the Kinetics-Skeleton [8], [30] and NTU RGB+D [20] datasets.

Conclusion

This work proposes a simple but effective method of generalizing convolutions to the graph domain. A topology-learnable graph convolution is proposed to fully use the self-learning ability of deep learning. Specifically, a spatial graph convolution is decomposed into a feature learning component that evolves the features of each graph vertex, and a graph vertex fusion component in which the latent graph topologies can be learned adaptively. The key is to let the graph topology learnable by

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China under Grant No.61702390, the Fundamental Research Funds for the Central Universities under Grant JB181001, Key Research and Development Program of Shaanxi Province under Grant No.2018ZDXM-GY-036.

References (32)

  • M. Liu et al.

    Enhanced skeleton visualization for view invariant human action recognition

    Pattern Recognit.

    (2017)
  • J. Atwood et al.

    Diffusion-convolutional neural networks

    Neural Information Processing Systems (NeurIPS)

    (2016)
  • D.K. Duvenaud et al.

    Convolutional networks on graphs for learning molecular fingerprints

    Neural Information Processing Systems (NeurIPS)

    (2015)
  • B. Fernando et al.

    Modeling video evolution for action recognition

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2015)
  • I. Goodfellow et al.

    Deep Learning

    (2016)
  • A. Graves et al.

    Speech recognition with deep recurrent neural networks

    IEEE International Conference on Acoustics, Speech and Signal Processing

    (2013)
  • W.L. Hamilton et al.

    Inductive representation learning on large graphs

    Neural Information Processing Systems (NeurIPS)

    (2017)
  • L. Kaiser et al.

    Depthwise separable convolutions for neural machine translation

    International Conference on Learning Representations (ICLR)

    (2018)
  • W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M....
  • Q. Ke et al.

    A new representation of skeleton sequences for 3d action recognition

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • T.S. Kim et al.

    Interpretable 3d human action analysis with temporal convolutional networks

    IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    (2017)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Neural Information Processing Systems (NeurIPS)

    (2012)
  • B. Li et al.

    Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn

    IEEE International Conference on Multimedia & Expo Workshops (ICMEW)

    (2017)
  • C. Li et al.

    Spatio-temporal graph convolution for skeleton based action recognition

    AAAI Conference on Artificial Intelligence (AAAI)

    (2018)
  • C. Li et al.

    Skeleton-based action recognition with convolutional neural networks

    IEEE International Conference on Multimedia & Expo Workshops (ICMEW)

    (2017)
  • M. Li et al.

    Actional-structural graph convolutional networks for skeleton-based action recognition.

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2019)
  • Cited by (31)

    • Frequency-driven channel attention-augmented full-scale temporal modeling network for skeleton-based action recognition

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Based on this finding, Yan et al. [14] proposed a spatial–temporal graph convolutional network (ST-GCN), which exploits the above information for the first time and promotes recognition accuracy to a new level. The great success of ST-GCN stimulated the subsequent proposal of many variants [15–30]. Although they have made considerable progress, two crucial and challenging problems still exist to be addressed.

    • Poisson kernel: Avoiding self-smoothing in graph convolutional networks

      2022, Pattern Recognition
      Citation Excerpt :

      Graph neural networks (GNNs) have attracted great attentions for dealing with the non-Euclidean data. Specifically, graph convolutional networks are typical series in GNNs and have been applied in learning graph representations successfully, such as the tasks of node embedding [1–3], graph classification [4,5], social behavior analysis [6,7], chemical and biological classification [8–12], multi-label recognition [13], human-object interaction [14,15], skeleton-based action recognition [16–21], human pose estimation [22], multi-video summarization [23], and time-sync comments denoising [24]. Our main contributions in this paper can be summarized as follows:

    • Graph-based neural network models with multiple self-supervised auxiliary tasks

      2021, Pattern Recognition Letters
      Citation Excerpt :

      In the last decade, neural networks approaches that can deal with with structured data have been gaining a lot of traction [7,11,26,30,39]. Due to the prevalence of data structured in the form of graphs, the capability to explicitly exploit structural relationships among data points is particularly useful in improving the performance for a variety of tasks, e.g. in human activity detection [57] and gate recognition [5]. Graph Convolutional Networks (GCNs, [26]) stand out as a particularly successful iteration of such networks, especially for semi-supervised problems.

    View all citing articles on Scopus
    View full text