Elsevier

Knowledge-Based Systems

Volume 122, 15 April 2017, Pages 64-74
Knowledge-Based Systems

The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences

https://doi.org/10.1016/j.knosys.2017.01.035Get rights and content

Abstract

Depth sequences are useful for action recognition since they are insensitive to illumination variation and provide geometric information. Many current action recognition methods are limited by being computationally expensive and requiring large-scale training data. Here we propose an effective method for human action recognition using depth sequences captured by depth cameras. A multi-resolution operation, the spatial Laplacian and temporal energy pyramid (SLTEP), decomposes the depth sequences into certain frequency bands in different space and time positions. A spatial aggregating and fusion scheme is applied to cluster the low-level features and concatenate two different feature types extracted from low and high frequency levels, respectively. We evaluate our approach on five public benchmark datasets (MSRAction3D, MSRGesture3D, MSRActionPairs, MSRDailyActivity3D, and NTU RGB+D) and demonstrate its advantages over existing methods and is likely to be highly useful for online applications.

Introduction

Human action recognition has been of longstanding interest to the computer vision community due to its widespread potential in real-word applications, for example human–computer interactions [1] and intelligent video surveillance [2]. Conventional human action recognition studies mainly focus on recognizing actions from videos captured by color cameras, and these approaches still face problems due to difficulties in handling object segmentation, illumination and texture variances. The emergence of inexpensive depth sensors such as Kinect [3], which capture depth information in real-time, has captured the imagination of many researchers, including those studying action recognition. Compared to conventional color-based cameras, subtracting the foreground from cluttered backgrounds is much easier in depth maps and, furthermore, depth sensors are insensitive to illumination variance, facilitating applications in dark environments. Depth sensors also provide a lot of geometric information, which facilitates spatial information extraction and target recognition.

An action or gesture recognition system should work independently of the actor's identity, the speed of the performance, and the inherent variance present in the realizations of action instances [4] whilst ensuring fast classification performance, especially for online action recognition. Various detection and representation methods have been proposed to acquire accurate similarity measurements and improve action recognition performance using depth sequences, e.g., cloud points [5], skeleton joints [6], [7], and hyper-surface normals [8]. Skeleton joints can be accessed in real-time by depth sensors and are computationally efficient for human action recognition but less so for other applications such as gesture recognition and human–object interactions. Cloud points and hyper-surface normals use raw depth maps as inputs and are more robust to noise and occlusion [9] but are resource intensive. In this paper, we focus on improving action recognition performance using depth maps at the lowest computational cost.

Our approach extracts features at different frequencies, which are then decomposed by the spatial Laplacian and temporal energy pyramid (SLTEP). For the low frequency component, 4D hyper-surface normals are extracted to capture spatial orientation cues. To retain the correlation between neighboring normals and make them more robust to noise, the local spatial neighborhood is introduced to cluster these low-level normal features and generate polynormal vectors. The coefficient-weighted differences between polynormals and visual words learnt by sparse coding are then computed to obtain the extra distribution information of low-level features, with the coefficient differences aggregated by average pooling over each spatial grid. For the high frequency component, we exploit maximum pooling over silhouettes to achieve motion invariance to temporal locations according to the temporal energy pyramid. Then, histogram of oriented gradient (HOG) features (introduced in [10]) are extracted from the spatial grids to record body shape information. Finally, the two different feature types are fused as the representation of the depth sequences.

The main contributions of this paper can be summarized as follows: first, by employing spatial Laplacian and temporal energy pyramids, we decompose depth sequences to extract complementary features at different frequencies; second, local sparse representation is employed to obtain extra spatial distribution information from low-level feature in the local neighborhoods; and third, we propose a spatial aggregation and fusion scheme to cluster the low-level features and concatenate the extracted complementary features as the final representation. The experimental results demonstrate the advantages of the proposed method with respect to recognition accuracy and computational efficiency.

The remainder of the paper is organized as follows. In Section 2, we review previous work related to action recognition. Section 3 describes our feature extraction and representation framework. In Section 4, we present experiments performed on five public benchmark datasets, the results of which demonstrate the effectiveness of our proposed approach compared to other published results. Finally, we conclude and discuss possibilities for future work in Section 5.

Section snippets

Related work

Space-time-based approaches are widely used in human action recognition using traditional color image sequences. These approaches rely on the detection and representation of space-time volumes. Laptev et al. [11] extracted features at multiple spatio-temporal scales to learn realistic human actions from movies. Based on Laplacian pyramid, Shao et al. [12] decomposed videos into a series of sub-band feature 3D volumes and presented a novel descriptor, called spatio-temporal Laplacian pyramid

Overview of the framework

A schematic of the proposed framework is shown in Fig. 1. Depth sequence processing is divided into two parts by spatial Laplacian and temporal energy pyramids. Low-level features, i.e., 4D normals and silhouettes, are respectively extracted from two different frequency components and then compressed by temporal pooling to produce compact representations. Local sparse representation is employed to obtain extra spatial distribution information from local neighborhoods. Then, we propose a spatial

Experiments and discussion

To evaluate the performance of our approach, we conduct experiments on five public benchmark datasets: MSRAction3D [5], MSRGesture3D [41], MSRActionPairs [8], MSRDailyActivity3D [6], and the recent NTU RGB+D dataset [29]. Sample frames from these datasets are shown in Fig. 5. We compare our algorithm with several state-of-the-art approaches for human action recognition from depth sequences. For fair comparison, the multimodal methods, e.g., color videos with depth sequences and depth sequences

Conclusions and future work

In this paper, we present an efficient approach for human action recognition using depth sequences. The depth sequences are decomposed into certain frequency bands by spatial Laplacian and temporal energy pyramids. 4D hyper-surface normals and HOG features are extracted from the differences and buried levels, respectively. To suppress outliers and obtain a compact expression of a sequence, maximal pooling is employed over the temporal segments.. Besides, local sparse representations are

Acknowledgments

The research was supported by National Natural Science Funds of China (Nos. 6140021567 and 6140051238), Natural Science Funds of Guangdong Province (No. 2015A030313744), Special Program of Guangdong Frontier and Key Technological Innovation (2016B010108010), Guangdong Technology Project (2016B010108010), Key Laboratory of Human-Machine Intelligence-Synergy Systems, Chinese Academy of Sciences (2014DP173025), Shenzhen Technology Project (JSGG20160331185256983).

References (41)

  • LiuZ. et al.

    3D-based deep convolutional neural network for action recognition with depth sequences

    Image Vision Comput.

    (2016)
  • D. Cireşan et al.

    Multi-column deep neural network for traffic sign classification

    Neural Netw.

    (2012)
  • YangJ. et al.

    A structure optimization framework for feed-forward neural networks using sparse representation

    Knowl.-Based Syst.

    (2016)
  • F. Moayedi et al.

    Structured sparse representation for human action recognition

    Neurocomputing

    (2015)
  • CaoF. et al.

    Pose and illumination variable face recognition via sparse representation and illumination dictionary

    Knowl.-Based Syst.

    (2016)
  • ChenS. et al.

    Discriminative local collaborative representation for online object tracking

    Knowl.-Based Syst.

    (2016)
  • J. Shotton et al.

    Real-time human pose recognition in parts from single depth images

  • LiW. et al.

    Action recognition based on a bag of 3D points

  • WangJ. et al.

    Mining actionlet ensemble for action recognition with depth cameras

  • WangJ. et al.

    Learning actionlet ensemble for 3D human action recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • Cited by (60)

    • Graph transformer network with temporal kernel attention for skeleton-based action recognition

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Human action recognition is a significant task with many real-world applications [1–5], such as human–computer interaction, robots, virtual reality, and intelligent video surveillance.

    • Exploiting spatio-temporal representation for 3D human action recognition from depth map sequences

      2021, Knowledge-Based Systems
      Citation Excerpt :

      The main contributions of this paper are summarized as follows: Action recognition researches mostly are driven by the advances in object recognition methods [27] and data-hungry techniques (readers can refer to [28–34] for good surveys). In this paper, we only cover the works related to our method.

    • Adaptive cross-fusion learning for multi-modal gesture recognition

      2021, Virtual Reality and Intelligent Hardware
    • Segment spatial-temporal representation and cooperative learning of convolution neural networks for multimodal-based action recognition

      2021, Neurocomputing
      Citation Excerpt :

      Human action recognition has attracted much attention recently due to its wide application, such as video surveillance and analysis, intelligent driving, and human–machine interaction [1–7].

    View all citing articles on Scopus
    View full text