The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences

doi:10.1016/j.knosys.2017.01.035

Knowledge-Based Systems

Volume 122, 15 April 2017, Pages 64-74

https://doi.org/10.1016/j.knosys.2017.01.035 Get rights and content

Abstract

Depth sequences are useful for action recognition since they are insensitive to illumination variation and provide geometric information. Many current action recognition methods are limited by being computationally expensive and requiring large-scale training data. Here we propose an effective method for human action recognition using depth sequences captured by depth cameras. A multi-resolution operation, the spatial Laplacian and temporal energy pyramid (SLTEP), decomposes the depth sequences into certain frequency bands in different space and time positions. A spatial aggregating and fusion scheme is applied to cluster the low-level features and concatenate two different feature types extracted from low and high frequency levels, respectively. We evaluate our approach on five public benchmark datasets (MSRAction3D, MSRGesture3D, MSRActionPairs, MSRDailyActivity3D, and NTU RGB+D) and demonstrate its advantages over existing methods and is likely to be highly useful for online applications.

Introduction

Human action recognition has been of longstanding interest to the computer vision community due to its widespread potential in real-word applications, for example human–computer interactions [1] and intelligent video surveillance [2]. Conventional human action recognition studies mainly focus on recognizing actions from videos captured by color cameras, and these approaches still face problems due to difficulties in handling object segmentation, illumination and texture variances. The emergence of inexpensive depth sensors such as Kinect [3], which capture depth information in real-time, has captured the imagination of many researchers, including those studying action recognition. Compared to conventional color-based cameras, subtracting the foreground from cluttered backgrounds is much easier in depth maps and, furthermore, depth sensors are insensitive to illumination variance, facilitating applications in dark environments. Depth sensors also provide a lot of geometric information, which facilitates spatial information extraction and target recognition.

An action or gesture recognition system should work independently of the actor's identity, the speed of the performance, and the inherent variance present in the realizations of action instances [4] whilst ensuring fast classification performance, especially for online action recognition. Various detection and representation methods have been proposed to acquire accurate similarity measurements and improve action recognition performance using depth sequences, e.g., cloud points [5], skeleton joints [6], [7], and hyper-surface normals [8]. Skeleton joints can be accessed in real-time by depth sensors and are computationally efficient for human action recognition but less so for other applications such as gesture recognition and human–object interactions. Cloud points and hyper-surface normals use raw depth maps as inputs and are more robust to noise and occlusion [9] but are resource intensive. In this paper, we focus on improving action recognition performance using depth maps at the lowest computational cost.

Our approach extracts features at different frequencies, which are then decomposed by the spatial Laplacian and temporal energy pyramid (SLTEP). For the low frequency component, 4D hyper-surface normals are extracted to capture spatial orientation cues. To retain the correlation between neighboring normals and make them more robust to noise, the local spatial neighborhood is introduced to cluster these low-level normal features and generate polynormal vectors. The coefficient-weighted differences between polynormals and visual words learnt by sparse coding are then computed to obtain the extra distribution information of low-level features, with the coefficient differences aggregated by average pooling over each spatial grid. For the high frequency component, we exploit maximum pooling over silhouettes to achieve motion invariance to temporal locations according to the temporal energy pyramid. Then, histogram of oriented gradient (HOG) features (introduced in [10]) are extracted from the spatial grids to record body shape information. Finally, the two different feature types are fused as the representation of the depth sequences.

The main contributions of this paper can be summarized as follows: first, by employing spatial Laplacian and temporal energy pyramids, we decompose depth sequences to extract complementary features at different frequencies; second, local sparse representation is employed to obtain extra spatial distribution information from low-level feature in the local neighborhoods; and third, we propose a spatial aggregation and fusion scheme to cluster the low-level features and concatenate the extracted complementary features as the final representation. The experimental results demonstrate the advantages of the proposed method with respect to recognition accuracy and computational efficiency.

The remainder of the paper is organized as follows. In Section 2, we review previous work related to action recognition. Section 3 describes our feature extraction and representation framework. In Section 4, we present experiments performed on five public benchmark datasets, the results of which demonstrate the effectiveness of our proposed approach compared to other published results. Finally, we conclude and discuss possibilities for future work in Section 5.

Section snippets

Related work

Space-time-based approaches are widely used in human action recognition using traditional color image sequences. These approaches rely on the detection and representation of space-time volumes. Laptev et al. [11] extracted features at multiple spatio-temporal scales to learn realistic human actions from movies. Based on Laplacian pyramid, Shao et al. [12] decomposed videos into a series of sub-band feature 3D volumes and presented a novel descriptor, called spatio-temporal Laplacian pyramid

Overview of the framework

A schematic of the proposed framework is shown in Fig. 1. Depth sequence processing is divided into two parts by spatial Laplacian and temporal energy pyramids. Low-level features, i.e., 4D normals and silhouettes, are respectively extracted from two different frequency components and then compressed by temporal pooling to produce compact representations. Local sparse representation is employed to obtain extra spatial distribution information from local neighborhoods. Then, we propose a spatial

Experiments and discussion

To evaluate the performance of our approach, we conduct experiments on five public benchmark datasets: MSRAction3D [5], MSRGesture3D [41], MSRActionPairs [8], MSRDailyActivity3D [6], and the recent NTU RGB+D dataset [29]. Sample frames from these datasets are shown in Fig. 5. We compare our algorithm with several state-of-the-art approaches for human action recognition from depth sequences. For fair comparison, the multimodal methods, e.g., color videos with depth sequences and depth sequences

Conclusions and future work

In this paper, we present an efficient approach for human action recognition using depth sequences. The depth sequences are decomposed into certain frequency bands by spatial Laplacian and temporal energy pyramids. 4D hyper-surface normals and HOG features are extracted from the differences and buried levels, respectively. To suppress outliers and obtain a compact expression of a sequence, maximal pooling is employed over the temporal segments.. Besides, local sparse representations are

Acknowledgments

The research was supported by National Natural Science Funds of China (Nos. 6140021567 and 6140051238), Natural Science Funds of Guangdong Province (No. 2015A030313744), Special Program of Guangdong Frontier and Key Technological Innovation (2016B010108010), Guangdong Technology Project (2016B010108010), Key Laboratory of Human-Machine Intelligence-Synergy Systems, Chinese Academy of Sciences (2014DP173025), Shenzhen Technology Project (JSGG20160331185256983).

References (41)

Y. Alvarez-Betancourt et al.
A keypoints-based feature extraction method for iris recognition under variable image quality conditions
Knowl.-Based Syst.
(2016)
MaX. et al.
Visual tracking via exemplar regression model
Knowl.-Based Syst.
(2016)
ChenX. et al.
Skeleton-based action recognition with extreme learning machines
Neurocomputing
(2015)
ChenG. et al.
Combining unsupervised learning and discrimination for 3D action recognition
Signal Process.
(2015)
ZhuG. et al.
Human action recognition using multi-layer codebooks of key poses and atomic motions
Signal Process. Image Commun.
(2016)
WangZ. et al.
Adaptive multi-view feature selection for human motion retrieval
Signal Process.
(2016)
H. Rahmani et al.
Discriminative human action classification using locality-constrained linear coding
Pattern Recognit. Lett.
(2016)
S. Althloothi et al.
Human activity recognition using multi-features and multiple kernel learning
Pattern Recognit.
(2014)
LiuH. et al.
RGB-D action recognition using linear coding
Neurocomputing
(2015)
LiuA.-A. et al.
Coupled hidden conditional random fields for RGB-D human action recognition
Signal Process.
(2015)

LiuZ. et al.

3D-based deep convolutional neural network for action recognition with depth sequences

Image Vision Comput.

(2016)

D. Cireşan et al.

Multi-column deep neural network for traffic sign classification

Neural Netw.

(2012)

YangJ. et al.

A structure optimization framework for feed-forward neural networks using sparse representation

Knowl.-Based Syst.

(2016)

F. Moayedi et al.

Structured sparse representation for human action recognition

Neurocomputing

(2015)

CaoF. et al.

Pose and illumination variable face recognition via sparse representation and illumination dictionary

Knowl.-Based Syst.

(2016)

ChenS. et al.

Discriminative local collaborative representation for online object tracking

Knowl.-Based Syst.

(2016)

J. Shotton et al.

Real-time human pose recognition in parts from single depth images

LiW. et al.

Action recognition based on a bag of 3D points

WangJ. et al.

Mining actionlet ensemble for action recognition with depth cameras

WangJ. et al.

Learning actionlet ensemble for 3D human action recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2014)

Cited by (60)

Graph transformer network with temporal kernel attention for skeleton-based action recognition
2022, Knowledge-Based Systems
Citation Excerpt :
Human action recognition is a significant task with many real-world applications [1–5], such as human–computer interaction, robots, virtual reality, and intelligent video surveillance.
Skeleton-based human action recognition has caused wide concern, as skeleton data can robustly adapt to dynamic circumstances such as camera view changes and background interference thus allowing recognition methods to focus on robust features. In recent studies, the human body is modeled as a topological graph, and the graph convolution network (GCN) is used to extract features of actions. Although GCN has a strong ability to learn spatial modes, it ignores the varying degrees of higher-order dependencies that are captured by message passing. Moreover, the joints represented by vertices are interdependent, and hence incorporating an attention mechanism to weigh dependencies is beneficial. In this work, we propose a kernel attention adaptive graph transformer network (KA-AGTN), which models the higher-order spatial dependencies between joints by the graph transformer operator based on multihead self-attention. In addition, the Temporal Kernel Attention (TKA) block in KA-AGTN generates a channel-level attention score using temporal features, which can enhance temporal motion correlation. After combining the two-stream framework and adaptive graph strategy, KA-AGTN outperforms the baseline 2s-AGCN by 1.9% and by 1% under X-Sub and X-View on the NTU-RGBD 60 dataset, by 3.2% and 3.1% under X-Sub and X-Set on the NTU-RGBD 120 dataset, and by 2% and 2.3% under Top-1 and Top-5 and achieves the state-of-the-art performance on the Kinetics-Skeleton 400 dataset.
Exploiting spatio-temporal representation for 3D human action recognition from depth map sequences
2021, Knowledge-Based Systems
Citation Excerpt :
The main contributions of this paper are summarized as follows: Action recognition researches mostly are driven by the advances in object recognition methods [27] and data-hungry techniques (readers can refer to [28–34] for good surveys). In this paper, we only cover the works related to our method.
Human action recognition based on 3D data is attracting increasing attention because it could provide more abundant spatial and temporal information compared with RGB videos. The challenge of the depth map based method is to capture the cues between spatial appearances and temporal motions. In this paper, we propose a straightforward and efficient framework for modeling the human action based on depth map sequences, considering the short-term and long-term dependencies. A frame-level feature, termed depth-oriented gradient vector (DOGV), is developed to capture the appearance and motion in a short-term duration. For a long-term dependence, we construct convolutional neural networks (CNNs) based backbone to aggregate frame-level features in the space and time sequence. The proposed method is comprehensively evaluated on four public benchmark datasets, including NTU RGB+D, NTU RGB+D 120, PKU-MMD and UOW LSC. The experimental results demonstrate that the proposed approach can solve the problem of 3D human action recognition in an efficient way and achieve the state-of-the-art performance.
Adaptive cross-fusion learning for multi-modal gesture recognition
2021, Virtual Reality and Intelligent Hardware
Gesture recognition has attracted significant attention because of its wide range of potential applications. Although multi-modal gesture recognition has made significant progress in recent years, a popular method still is simply fusing prediction scores at the end of each branch, which often ignores complementary features among different modalities in the early stage and does not fuse the complementary features into a more discriminative feature.
This paper proposes an Adaptive Cross-modal Weighting (ACmW) scheme to exploit complementarity features from RGB-D data in this study. The scheme learns relations among different modalities by combining the features of different data streams. The proposed ACmW module contains two key functions: (1) fusing complementary features from multiple streams through an adaptive one-dimensional convolution; and (2) modeling the correlation of multi-stream complementary features in the time dimension. Through the effective combination of these two functional modules, the proposed ACmW can automatically analyze the relationship between the complementary features from different streams, and can fuse them in the spatial and temporal dimensions.
Extensive experiments validate the effectiveness of the proposed method, and show that our method outperforms state-of-the-art methods on IsoGD and NVGesture.
Segment spatial-temporal representation and cooperative learning of convolution neural networks for multimodal-based action recognition
2021, Neurocomputing
Citation Excerpt :
Human action recognition has attracted much attention recently due to its wide application, such as video surveillance and analysis, intelligent driving, and human–machine interaction [1–7].
The multimodal based human action recognition is an attracting increasing topic since the different modalities can provide complementary information. However, it is difficult to improve the recognition performance due to the limitation of the ability to learn spatial-temporal features. In this paper, we propose a novel approach for multimodal human action recognition by learning complementary features from RGB-D sequence. Firstly, a segmented rank pooling method is proposed to compress the entire RGB-D sequence into dynamic images as inputs to the Convolutional Networks (ConvNets) for capturing spatial-temporal information. Secondly, a Segment Cooperative ConvNets (SC-ConvNets) is designed to learn the complementary features from RGB-D modalities. Different from the ConvNets-based approaches that learn multimodal features with multiple separate networks, the proposed SC-ConvNets enhance the recognition performance through joint optimization learning of single ConvNets. Then a single loss function is optimized to narrow the variance between RGB and depth modalities. In order to verify the effectiveness of the proposed method, we evaluate the SC-ConvNets on four public benchmark multimodal datasets, including NTU RGB+D 60, NTU RGB+D 120, SYSU 3D HOI, and PKU-MMD datasets. The proposed method achieves an accuracy of 89.4% and 91.2% for cross-subject and cross-view on NTU RGB+D 60, 86.9% and 87.7% for cross-subject and cross-setup on NTU RGB+D 120, 92.1% and 93.2% for cross-subject and cross-view on PKU-MMD, which are the state-of-the-art, and the accuracy of 84.2% and 82.9% for setting-1 and setting-2 on SYSU 3D HOI, which are comparable to the state-of-the-art. The experimental results demonstrate that the proposed segmented rank pooling can represent discriminative spatial-temporal information from the entire RGB and depth sequence, and the proposed SC-ConvNets can enhance recognition performance by learning complementary features from different modalities.
Memory transformation networks for weakly supervised visual classification
2020, Knowledge-Based Systems
The lack of labeled exemplars makes video classification based on supervised neural networks difficult and challenging. Utilizing external memory that contains task-related knowledge is a beneficial way to learn a category from a handful of samples; however, most existing memory-augmented neural networks still struggle to provide a satisfactory solution for multi-modal external data due to the high dimensionality and massive volume. In light of this, we propose a Memory Transformation Network (MTN) to convert external knowledge, by involving embedded and concentrated memories, so as to leverage it feasibly for video classification with weak supervision. Specifically, we employ a multi-modal deep autoencoder to project external visual and textual information onto a shared space to produce joint embedded memory, which can capture the correlation amongst different modalities to enhance the expressive ability. The curse of dimensionality issue can also be alleviated owing to the inherent dimension reduction ability of the autoencoder. Besides, an attention-based compression mechanism is employed to generate concentrated memory, which records useful information related to a specific task. In this way, the obtained concentrated memory is relatively lightweight to mitigate the time-consuming content-based addressing on large-volume memory. Our model outperforms the state-of-the-arts by 5.44% and 1.81% on average in two metrics over three real-world video datasets, demonstrating its effectiveness and superiority on visual classification with limited labeled exemplars.
Segment differential aggregation representation and supervised compensation learning of ConvNets for human action recognition
2024, Science China Technological Sciences

View all citing articles on Scopus

View full text

The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences

Abstract

Introduction

Section snippets

Related work

Overview of the framework

Experiments and discussion

Conclusions and future work

Acknowledgments

Knowl.-Based Syst.

Knowl.-Based Syst.

Neurocomputing

Signal Process.

Signal Process. Image Commun.

Signal Process.

Pattern Recognit. Lett.

Pattern Recognit.

Neurocomputing

Signal Process.

Image Vision Comput.

Neural Netw.

Knowl.-Based Syst.

Neurocomputing

Knowl.-Based Syst.

Knowl.-Based Syst.

Real-time human pose recognition in parts from single depth images

Action recognition based on a bag of 3D points

Mining actionlet ensemble for action recognition with depth cameras

Learning actionlet ensemble for 3D human action recognition

IEEE Trans. Pattern Anal. Mach. Intell.