Topology-learnable graph convolution for skeleton-based action recognition
Introduction
Convolutional neural networks (CNNs) [4] have achieved great success in machine learning fields, such as one-dimensional speech [5], two-dimensional images [11] and three-dimensional videos [27], where the underlying data representation has a grid-like structure. Local convolution filters with learnable parameters can be reused efficiently on all input positions. However, many fields involve data that cannot be represented in a grid-like structure. Such kinds of data generally lie in an irregular domain and can be usually represented in graph-based forms. Irregular structures prevent CNNs from being generalized into the graph domain.
Nevertheless, generalizing convolutions to the graph domain is an emerging topic in deep learning research. The advances in constructing graph convolutional networks (GCNs) on graphs are generally categorized into 1) spectral and 2) spatial perspectives. Spatial perspective defines convolutions directly on the graph within k-step neighbors [1], [2]. However, maintaining the weight sharing property of CNNs on different-sized neighborhoods is challenging. Generally, a transition matrix is constructed to define the neighborhood for each node, and a specific weight matrix is required for each node degree [1], [2]. Niepert et al. [18] assembled and normalized a fixed-size neighborhood for each node in the selected fixed-length sequence and learned neighborhood representations with CNNs. In such cases, GCNs are applied in the transductive setting with fixed graphs.
Handcrafted and normalized adjacency matrices are usually fixed over all layers when constructing GCNs. However, the handcrafted graph topology may not be optimal. Hamilton et al. [6] proposed an inductive framework that simultaneously learned the topological structure of each node’s neighborhood and the distribution of node features in the neighborhood. A set of aggregator functions is learned to aggregate the features from a node’s local neighborhood. Velickovic et al. [28] proposed an attention-based architecture for the node classification of graph-structured data. The proposed self-attention strategy learns the adaptive relationships between a node and its neighbors. This method makes the weight matrix of neighborhoods learnable. Shi et al. [23] overcame the constraint of local neighborhood by introducing the non-local idea into GCNs and constructed non-local GCNs based on spatial-temporal GCNs (ST-GCNs) [30] for skeleton-based action recognition.
Inspired by these works, we explore spatial graph convolutions based on the ST-GCN to fully use the self-learning ability of deep learning. Generally, spatial perspective graph convolutions can be implemented in two steps: (a) feature learning and (b) graph vertex fusion. The feature learning operations for each node can be implemented using a 1 × 1 2D-convolution [23], [30] or matrix multiplication [28]. Graph vertex fusion can be performed by matrix multiplication with a specifically initialized topology-learnable weight.
Unlike the non-local strategy [23], which breaks the constraint of local neighborhood by introducing the non-local idea into GCNs, we believe that the proposed topology-learnable graph convolution processes an inherent ability to break the local constraint. The non-local idea is redundant when the traditional k-step adjacency matrix is changed into topology-learnable ones. In contrast with the self-attention strategy [28], which learns the adaptive relationships between a node and its neighbors, we extend the adaptive relationships to the entire graph.
The remainder of this work is organized as follows. Section 2 reviews the related works. Section 3 provides a detailed description and analysis of the proposed topology-learnable graph convolution, followed by the experiments in Section 4. Finally, Section 5 presents the conclusion.
Section snippets
Graph convolutional network for action recognition
Skeletal action recognition has been widely explored using conventional methods with handcrafted features or neural network based methods. The human skeleton can be naturally represented by a graph, in which each vertex represents one human joint. Thus, human actions can be viewed as spatial temporal graphs. In recent years, researchers have generalized convolutions to the graph domain for skeleton-based action recognition.
Yan et al. [30] proposed ST-GCNs for skeleton-based action recognition.
Graph convolution
The skeleton data in one frame of human action sequences are always provided as a vector sequence. Each vector represents the 2D or 3D coordinates of one human joint. According to the kinematics model of the human body, the skeleton data can be naturally represented by a graph. In the ST-GCN [30], the skeleton graph is constructed using joints as vertices and bones as edges. The corresponding joints in adjacent frames are connected as time edges. The spatial-temporal skeleton graph can be
Experiments
Our research is based on ST-GCN [30] and 2s-AGCN [23]. Therefore, our experiments are conducted on the two same large-scale action recognition datasets: the Kinetics-Skeleton [8], [30] and NTU RGB+D [20] datasets.
Conclusion
This work proposes a simple but effective method of generalizing convolutions to the graph domain. A topology-learnable graph convolution is proposed to fully use the self-learning ability of deep learning. Specifically, a spatial graph convolution is decomposed into a feature learning component that evolves the features of each graph vertex, and a graph vertex fusion component in which the latent graph topologies can be learned adaptively. The key is to let the graph topology learnable by
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is partially supported by the National Natural Science Foundation of China under Grant No.61702390, the Fundamental Research Funds for the Central Universities under Grant JB181001, Key Research and Development Program of Shaanxi Province under Grant No.2018ZDXM-GY-036.
References (32)
- et al.
Enhanced skeleton visualization for view invariant human action recognition
Pattern Recognit.
(2017) - et al.
Diffusion-convolutional neural networks
Neural Information Processing Systems (NeurIPS)
(2016) - et al.
Convolutional networks on graphs for learning molecular fingerprints
Neural Information Processing Systems (NeurIPS)
(2015) - et al.
Modeling video evolution for action recognition
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015) - et al.
Deep Learning
(2016) - et al.
Speech recognition with deep recurrent neural networks
IEEE International Conference on Acoustics, Speech and Signal Processing
(2013) - et al.
Inductive representation learning on large graphs
Neural Information Processing Systems (NeurIPS)
(2017) - et al.
Depthwise separable convolutions for neural machine translation
International Conference on Learning Representations (ICLR)
(2018) - W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M....
- et al.
A new representation of skeleton sequences for 3d action recognition
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2017)
Interpretable 3d human action analysis with temporal convolutional networks
IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Imagenet classification with deep convolutional neural networks
Neural Information Processing Systems (NeurIPS)
Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn
IEEE International Conference on Multimedia & Expo Workshops (ICMEW)
Spatio-temporal graph convolution for skeleton based action recognition
AAAI Conference on Artificial Intelligence (AAAI)
Skeleton-based action recognition with convolutional neural networks
IEEE International Conference on Multimedia & Expo Workshops (ICMEW)
Actional-structural graph convolutional networks for skeleton-based action recognition.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Cited by (31)
Frequency-driven channel attention-augmented full-scale temporal modeling network for skeleton-based action recognition
2022, Knowledge-Based SystemsCitation Excerpt :Based on this finding, Yan et al. [14] proposed a spatial–temporal graph convolutional network (ST-GCN), which exploits the above information for the first time and promotes recognition accuracy to a new level. The great success of ST-GCN stimulated the subsequent proposal of many variants [15–30]. Although they have made considerable progress, two crucial and challenging problems still exist to be addressed.
Poisson kernel: Avoiding self-smoothing in graph convolutional networks
2022, Pattern RecognitionCitation Excerpt :Graph neural networks (GNNs) have attracted great attentions for dealing with the non-Euclidean data. Specifically, graph convolutional networks are typical series in GNNs and have been applied in learning graph representations successfully, such as the tasks of node embedding [1–3], graph classification [4,5], social behavior analysis [6,7], chemical and biological classification [8–12], multi-label recognition [13], human-object interaction [14,15], skeleton-based action recognition [16–21], human pose estimation [22], multi-video summarization [23], and time-sync comments denoising [24]. Our main contributions in this paper can be summarized as follows:
Graph-based neural network models with multiple self-supervised auxiliary tasks
2021, Pattern Recognition LettersCitation Excerpt :In the last decade, neural networks approaches that can deal with with structured data have been gaining a lot of traction [7,11,26,30,39]. Due to the prevalence of data structured in the form of graphs, the capability to explicitly exploit structural relationships among data points is particularly useful in improving the performance for a variety of tasks, e.g. in human activity detection [57] and gate recognition [5]. Graph Convolutional Networks (GCNs, [26]) stand out as a particularly successful iteration of such networks, especially for semi-supervised problems.
Multi-scale skeleton adaptive weighted GCN for skeleton-based human action recognition in IoT
2021, Applied Soft Computing