Topology-learnable graph convolution for skeleton-based action recognition

doi:10.1016/j.patrec.2020.05.005

Pattern Recognition Letters

Volume 135, July 2020, Pages 286-292

https://doi.org/10.1016/j.patrec.2020.05.005 Get rights and content

Highlights

•
An effective method is proposed to generalize convolutions to the graph domain.
•
The latent graph topologies beyond the handcrafted topology can be learned adaptively.
•
A topology-learnable graph convolution network is constructed.
•
The self-learning process makes the graph convolution more flexible and universal.

Abstract

Graph convolutional networks (GCNs) generalize convolutional neural networks into irregular graph-like structures. Generally, graph topologies are set by hand and fixed over all layers. Handcrafted connections may not be optimal and cannot fully use the self-learning ability of deep learning. In this work, we explore a topology-learnable graph convolution for skeleton-based action recognition. Specifically, a spatial graph convolution can be decomposed into a feature learning component that evolves the features of each graph vertex, and a graph vertex fusion component in which the latent graph topologies can be learned adaptively. Different initialization strategies for the learnable fusion matrix are evaluated. Experimental results that are based on the spatial-temporal GCNs for skeleton-based action recognition, demonstrate that convolution can work on graphs like on images, even if only a specific fusion matrix initialization that uses adjacency matrices is applied. Moreover, the self-learning process can learn the latent topology of a graph beyond the handcrafted topology, thereby making graph convolution flexible and universal.

Introduction

Convolutional neural networks (CNNs) [4] have achieved great success in machine learning fields, such as one-dimensional speech [5], two-dimensional images [11] and three-dimensional videos [27], where the underlying data representation has a grid-like structure. Local convolution filters with learnable parameters can be reused efficiently on all input positions. However, many fields involve data that cannot be represented in a grid-like structure. Such kinds of data generally lie in an irregular domain and can be usually represented in graph-based forms. Irregular structures prevent CNNs from being generalized into the graph domain.

Nevertheless, generalizing convolutions to the graph domain is an emerging topic in deep learning research. The advances in constructing graph convolutional networks (GCNs) on graphs are generally categorized into 1) spectral and 2) spatial perspectives. Spatial perspective defines convolutions directly on the graph within k-step neighbors [1], [2]. However, maintaining the weight sharing property of CNNs on different-sized neighborhoods is challenging. Generally, a transition matrix is constructed to define the neighborhood for each node, and a specific weight matrix is required for each node degree [1], [2]. Niepert et al. [18] assembled and normalized a fixed-size neighborhood for each node in the selected fixed-length sequence and learned neighborhood representations with CNNs. In such cases, GCNs are applied in the transductive setting with fixed graphs.

Handcrafted and normalized adjacency matrices are usually fixed over all layers when constructing GCNs. However, the handcrafted graph topology may not be optimal. Hamilton et al. [6] proposed an inductive framework that simultaneously learned the topological structure of each node’s neighborhood and the distribution of node features in the neighborhood. A set of aggregator functions is learned to aggregate the features from a node’s local neighborhood. Velickovic et al. [28] proposed an attention-based architecture for the node classification of graph-structured data. The proposed self-attention strategy learns the adaptive relationships between a node and its neighbors. This method makes the weight matrix of neighborhoods learnable. Shi et al. [23] overcame the constraint of local neighborhood by introducing the non-local idea into GCNs and constructed non-local GCNs based on spatial-temporal GCNs (ST-GCNs) [30] for skeleton-based action recognition.

Inspired by these works, we explore spatial graph convolutions based on the ST-GCN to fully use the self-learning ability of deep learning. Generally, spatial perspective graph convolutions can be implemented in two steps: (a) feature learning and (b) graph vertex fusion. The feature learning operations for each node can be implemented using a 1 × 1 2D-convolution [23], [30] or matrix multiplication [28]. Graph vertex fusion can be performed by matrix multiplication with a specifically initialized topology-learnable weight.

Unlike the non-local strategy [23], which breaks the constraint of local neighborhood by introducing the non-local idea into GCNs, we believe that the proposed topology-learnable graph convolution processes an inherent ability to break the local constraint. The non-local idea is redundant when the traditional k-step adjacency matrix is changed into topology-learnable ones. In contrast with the self-attention strategy [28], which learns the adaptive relationships between a node and its neighbors, we extend the adaptive relationships to the entire graph.

The remainder of this work is organized as follows. Section 2 reviews the related works. Section 3 provides a detailed description and analysis of the proposed topology-learnable graph convolution, followed by the experiments in Section 4. Finally, Section 5 presents the conclusion.

Section snippets

Graph convolutional network for action recognition

Skeletal action recognition has been widely explored using conventional methods with handcrafted features or neural network based methods. The human skeleton can be naturally represented by a graph, in which each vertex represents one human joint. Thus, human actions can be viewed as spatial temporal graphs. In recent years, researchers have generalized convolutions to the graph domain for skeleton-based action recognition.

Yan et al. [30] proposed ST-GCNs for skeleton-based action recognition.

Graph convolution

The skeleton data in one frame of human action sequences are always provided as a vector sequence. Each vector represents the 2D or 3D coordinates of one human joint. According to the kinematics model of the human body, the skeleton data can be naturally represented by a graph. In the ST-GCN [30], the skeleton graph is constructed using joints as vertices and bones as edges. The corresponding joints in adjacent frames are connected as time edges. The spatial-temporal skeleton graph can be

Experiments

Our research is based on ST-GCN [30] and 2s-AGCN [23]. Therefore, our experiments are conducted on the two same large-scale action recognition datasets: the Kinetics-Skeleton [8], [30] and NTU RGB+D [20] datasets.

Conclusion

This work proposes a simple but effective method of generalizing convolutions to the graph domain. A topology-learnable graph convolution is proposed to fully use the self-learning ability of deep learning. Specifically, a spatial graph convolution is decomposed into a feature learning component that evolves the features of each graph vertex, and a graph vertex fusion component in which the latent graph topologies can be learned adaptively. The key is to let the graph topology learnable by

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China under Grant No.61702390, the Fundamental Research Funds for the Central Universities under Grant JB181001, Key Research and Development Program of Shaanxi Province under Grant No.2018ZDXM-GY-036.

References (32)

M. Liu et al.
Enhanced skeleton visualization for view invariant human action recognition
Pattern Recognit.
(2017)
J. Atwood et al.
Diffusion-convolutional neural networks
Neural Information Processing Systems (NeurIPS)
(2016)
D.K. Duvenaud et al.
Convolutional networks on graphs for learning molecular fingerprints
Neural Information Processing Systems (NeurIPS)
(2015)
B. Fernando et al.
Modeling video evolution for action recognition
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015)
I. Goodfellow et al.
Deep Learning
(2016)
A. Graves et al.
Speech recognition with deep recurrent neural networks
IEEE International Conference on Acoustics, Speech and Signal Processing
(2013)
W.L. Hamilton et al.
Inductive representation learning on large graphs
Neural Information Processing Systems (NeurIPS)
(2017)
L. Kaiser et al.
Depthwise separable convolutions for neural machine translation
International Conference on Learning Representations (ICLR)
(2018)
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M....
Q. Ke et al.
A new representation of skeleton sequences for 3d action recognition
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2017)

T.S. Kim et al.

Interpretable 3d human action analysis with temporal convolutional networks

IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

(2017)

A. Krizhevsky et al.

Imagenet classification with deep convolutional neural networks

Neural Information Processing Systems (NeurIPS)

(2012)

B. Li et al.

Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn

IEEE International Conference on Multimedia & Expo Workshops (ICMEW)

(2017)

C. Li et al.

Spatio-temporal graph convolution for skeleton based action recognition

AAAI Conference on Artificial Intelligence (AAAI)

(2018)

C. Li et al.

Skeleton-based action recognition with convolutional neural networks

IEEE International Conference on Multimedia & Expo Workshops (ICMEW)

(2017)

M. Li et al.

Actional-structural graph convolutional networks for skeleton-based action recognition.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2019)

Cited by (31)

Frequency-driven channel attention-augmented full-scale temporal modeling network for skeleton-based action recognition
2022, Knowledge-Based Systems
Citation Excerpt :
Based on this finding, Yan et al. [14] proposed a spatial–temporal graph convolutional network (ST-GCN), which exploits the above information for the first time and promotes recognition accuracy to a new level. The great success of ST-GCN stimulated the subsequent proposal of many variants [15–30]. Although they have made considerable progress, two crucial and challenging problems still exist to be addressed.
The skeleton-based human action recognition has become a popular research focus due to its promising applications. The current methods that model skeletons as spatial–temporal graphs have solidly advanced the state-of-the-art performance. However, there are still two problems to be addressed: (1) Although many existing methods employ multi-scale temporal modeling modules, they are still insufficient to fully capture both short-term and long-term temporal clues. (2) Channel attention modules are often employed for this task to improve recognition accuracy. However, they all use the same strategies to aggregate information from spatial and temporal dimensions, ignoring the fact that the semantic information in these two dimensions is quite different, thus leading to suboptimal network performance. In response to the above problems, we propose a frequency-driven channel attention-augmented full-scale temporal modeling network (FF-TMN) that incorporates two novel modules: (1) An effective and efficient full-scale temporal modeling module (FTMM) encompassing up to three multi-scale modeling strategies is proposed to equip the network with full-scale temporal modeling capabilities. (2) We propose a frequency-driven channel attention module (FCAM) tailored for the HAR task, which is the first module to generate channel descriptors with two different aggregation strategies: global average pooling for the spatial dimension and discrete cosine transform for the temporal dimension. Exhaustive experiments on three challenging large-scale datasets demonstrate that our FF-TMN achieves state-of-the-art performance. The code will be released on https://github.com/desertfex/FF-TMN to facilitate communication.
Poisson kernel: Avoiding self-smoothing in graph convolutional networks
2022, Pattern Recognition
Citation Excerpt :
Graph neural networks (GNNs) have attracted great attentions for dealing with the non-Euclidean data. Specifically, graph convolutional networks are typical series in GNNs and have been applied in learning graph representations successfully, such as the tasks of node embedding [1–3], graph classification [4,5], social behavior analysis [6,7], chemical and biological classification [8–12], multi-label recognition [13], human-object interaction [14,15], skeleton-based action recognition [16–21], human pose estimation [22], multi-video summarization [23], and time-sync comments denoising [24]. Our main contributions in this paper can be summarized as follows:
Graph convolutional network is now an effective tool to deal with non-Euclidean data, such as social behavior analysis, molecular structure analysis, and skeleton-based action recognition. Graph convolutional kernel is one of the most significant factors in graph convolutional networks to extract nodes’ feature, and some variants of it have achieved highly satisfactory performance theoretically and experimentally. However, there was limited research about how exactly different graph structures influence the performance of these kernels. Some existing methods used an adaptive convolutional kernel to deal with a given graph structure, which still not explore the internal reasons. In this paper, we start from theoretical analysis of the spectral graph and study the properties of existing graph convolutional kernels, revealing the self-smoothing phenomenon and its effect in specific structured graphs. After that, we propose the Poisson kernel that can avoid self-smoothing without training any adaptive kernel. Experimental results demonstrate that our Poisson kernel not only works well on the benchmark datasets where state-of-the-art methods work fine, but also is evidently superior to them in synthetic datasets.
Graph-based neural network models with multiple self-supervised auxiliary tasks
2021, Pattern Recognition Letters
Citation Excerpt :
In the last decade, neural networks approaches that can deal with with structured data have been gaining a lot of traction [7,11,26,30,39]. Due to the prevalence of data structured in the form of graphs, the capability to explicitly exploit structural relationships among data points is particularly useful in improving the performance for a variety of tasks, e.g. in human activity detection [57] and gate recognition [5]. Graph Convolutional Networks (GCNs, [26]) stand out as a particularly successful iteration of such networks, especially for semi-supervised problems.
Self-supervised learning is currently gaining a lot of attention, as it allows neural networks to learn robust representations from large quantities of unlabeled data. Additionally, multi-task learning can further improve representation learning by training networks simultaneously on related tasks, leading to significant performance improvements. In this paper, we propose three novel self-supervised auxiliary tasks to train graph-based neural network models in a multi-task fashion. Since Graph Convolutional Networks are among the most promising approaches for capturing relationships among structured data points, we use them as a building block to achieve competitive results on standard semi-supervised graph classification tasks.
Multi-scale skeleton adaptive weighted GCN for skeleton-based human action recognition in IoT
2021, Applied Soft Computing
Skeleton-based human action recognition has become a hot topic due to its potential advantages. Graph convolution network (GCN) has obtained remarkable performances in the modeling of skeleton-based human action recognition in IoT. In order to capture robust spatial–temporal features from the human skeleton, a powerful feature extractor is essential. However, Most GCN-based methods use the fixed graph topology. Besides, only a single-scale feature is used, and the multi-scale information is ignored. In this paper, we propose a multi-scale skeleton adaptive weighted graph convolution network (MS-AWGCN) for skeleton-based action recognition. Specifically, a multi-scale skeleton graph convolution network is adopted to extract more abundant spatial features of skeletons. Moreover, we develop a simple graph vertex fusion strategy, which can learn the latent graph topology adaptively by replacing the handcrafted adjacency matrix with a learnable matrix. According to different sampling strategies, weighted learning method is adopted to enrich features while aggregating. Experiments on three large datasets illustrate that the proposed method achieves comparable performances to state-of-the-art methods. Our proposed method attains an improvement of 0.9% and 0.7% respectively over the recent GCN-based method on the NTU RGB+D and Kinetics dataset.
A multimodal gesture recognition dataset for desktop human-computer interaction
2024, arXiv
Fusing angular features for skeleton-based action recognition using multi-stream graph convolution network
2024, IET Image Processing

View all citing articles on Scopus

View full text

Topology-learnable graph convolution for skeleton-based action recognition

Highlights

Abstract

Introduction

Section snippets

Graph convolutional network for action recognition

Graph convolution

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgments

Pattern Recognit.

Diffusion-convolutional neural networks

Neural Information Processing Systems (NeurIPS)

Convolutional networks on graphs for learning molecular fingerprints

Neural Information Processing Systems (NeurIPS)

Modeling video evolution for action recognition

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Deep Learning

Speech recognition with deep recurrent neural networks

IEEE International Conference on Acoustics, Speech and Signal Processing

Inductive representation learning on large graphs

Neural Information Processing Systems (NeurIPS)

Depthwise separable convolutions for neural machine translation

International Conference on Learning Representations (ICLR)

A new representation of skeleton sequences for 3d action recognition

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Interpretable 3d human action analysis with temporal convolutional networks

IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Imagenet classification with deep convolutional neural networks

Neural Information Processing Systems (NeurIPS)

Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn

IEEE International Conference on Multimedia & Expo Workshops (ICMEW)

Spatio-temporal graph convolution for skeleton based action recognition

AAAI Conference on Artificial Intelligence (AAAI)

Skeleton-based action recognition with convolutional neural networks

IEEE International Conference on Multimedia & Expo Workshops (ICMEW)

Actional-structural graph convolutional networks for skeleton-based action recognition.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)