Elsevier

Neurocomputing

Volume 149, Part A, 3 February 2015, Pages 79-85
Neurocomputing

RGB-D action recognition using linear coding

https://doi.org/10.1016/j.neucom.2013.12.061Get rights and content

Abstract

In this paper, we investigate action recognition using an inexpensive RGB-D sensor (Microsoft Kinect). First, a depth spatial-temporal descriptor is developed to extract the interested local regions in depth image. Such descriptors are very robust to the illumination and background clutter. Then the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor are combined and feeded into a linear coding framework to get an effective feature vector, which can be used for action classification. Finally, extensive experiments are conducted on a publicly available RGB-D action recognition dataset and the proposed method shows promising results.

Introduction

Recognition of human actions has been an active research topic in computer vision. In the past decade, research has mainly focused on learning and recognizing actions from video sequences captured from a single camera and rich literature can be found in a wide range of fields including computer vision, pattern recognition, machine leaning and signal processing. Recently, there are some approaches using local spatio-temporal descriptors together with bag-of-words model to represent the action. Since these approaches do not rely on any preprocessing techniques, e.g. foreground detection or body-part tracking, they are relatively robust to the change of viewpoint, noise, background, and illumination. However, most existing work on action recognition is based on color video, which leads to relatively low accuracy even when there is no clutter.

Different from these work, our motivation is driven by the application of the famous mass-production consumer electronics device Kinect, which provides a depth stream and a color stream. Kinect has been applied in extensive fields including people detection and tracking [1], [2]. Currently there exist very few work that utilize the color-depth sensor combination for human action recognition. For example, Ref. [3] used the depth information but totally ignored the depth information. In fact, as we will analyze, the color information and depth information can be complementary since the human actions are in essence three-dimensional. However, how to effectively fuse the color and depth information remains a great challenging problem. In this paper, we extract the local descriptors from the color and depth video and utilize the linear coding framework to integrate the color and depth information. The main contributions are summarized as follows:

  • 1.

    The conventional STIP descriptor is extended by incorporating depth information to deal with depth video. Such descriptors are very robust to the illumination and background clutter.

  • 2.

    A linear coding framework is developed to fuse the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor to form robust feature vector. In addition, we further exploit the temporal intrinsics of the video sequence and design a new pooling technology to improve the description performance.

  • 3.

    Extensive experiments are conducted on a publicly available RGB-D action recognition dataset and the proposed method shows promising results.

The organization of this paper is as follows: in Section 2 we introduce the feature extraction. 3 Coding approaches, 4 Pooling strategy present the coding and pooling methods, respectively. The experimental results are given in Section 5. Finally, Section 6 gives some conclusions.

Section snippets

Feature extraction

There are several schemes applied to time-consistent scene recognition problems. Some of them are statistics based approaches, such as Hidden Markov Models, Latent-Dynamic Discriminative Model [4], and so on. Differently, Space-Time Interest Points (STIPs) [5] regard the temporal axis as the same as the spatial axes and looks for the features along the temporal axis as well. We prefer to the latter ones because the time parameter of the sample is essentially the same as the space parameters in

Coding approaches

A popular method for coding is the vector quantization (VQ) method, which solves the following constrained least square fitting problem:minCi=1MxiBci22s.t.ci0=1,ci1=1,ci0,i,where C=[c1,c2,,cM] is the set of codes for X=[x1,x2,,xM]. The cardinality constraint ci0=1 means that there will be only one non-zero element in each code ci, corresponding to the quantization id of xi. The non-negative, ℓ1 constraint ci1=1, ci0 means that the coding weight for xi is 1. In practice, the

Pooling strategy

Similar to the VQ coding approach, the LLC coding coefficients ci are expected to be combined into a global representation of the sample for classification. In early work of VQ and LLC, SPM framework [12] is frequently used for pooling coding coefficients. For SPM, the image is first subdivided at several different levels of resolution, then for each level of resolution, the coding coefficients that fall in each spatial bin are summed and finally all the spatial histograms are weighted

Experimental results

In this section, we first introduce the details about the utilized dataset and the concerned methods. Then, we show the extensive experimental results in the second part.

Conclusion

In this paper, we perform action recognition using an inexpensive RGB-D sensor. A depth spatial-temporal descriptor is developed to extract the interested local regions in depth image. Such descriptors are very robust to the illumination and background clutter. Further, the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor are combined into the linear coding framework and an effective feature vectors can be constructed for action classification. Finally, extensive

Acknowledgement

This work was supported by the National Key Project for Basic Research of China (Grant no. 2013CB329403), the National Natural Science Foundation of China (Grant nos. 61075027, 91120011 and 61210013), and Tsinghua Self-innovation Project (Grant no. 20111081111) and in part by the Tsinghua University Initiative Scientific Research Program (Grant no. 20131089295).

Huaping Liu received the Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in 2004. He is currently an Associate Professor in the Department of Computer Science and Technology at Tsinghua University. His research interests include intelligent control and robotics.

References (14)

  • B. Ni, G. Wang, M. Pierre, RGBD-HuDaAct: a color-depth video database for human daily activity recognition, in: IEEE...
  • J. Sung, C. Ponce, B. Selman, A. Saxena, Unstructured human activity detection from RGBD images, in: IEEE International...
  • W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3d points, in: IEEE Conference on Computer Vision and...
  • L. Morency, A. Quattoni, T. Darrell, Latent-dynamic discriminative models for continuous gesture recognition, in: IEEE...
  • I. Laptev

    On space-time interest points

    Int. J. Comput. Vis.

    (2005)
  • N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Conference on Computer Vision and...
  • N. Dalal, B. Triggs, C. Schmid, Human detection using oriented histograms of flow and appearance, in: European...
There are more references available in the full text version of this article.

Cited by (16)

  • A two-level attention-based interaction model for multi-person activity recognition

    2018, Neurocomputing
    Citation Excerpt :

    A large number of works have concentrated on action recognition in RGB or RGB-D data [29–31]. Liu et al. [29] investigate action recognition using an inexpressive RGB-D sensor. Here, we note that actions of persons can be described by the evolutions of a series of human poses.

  • Rank pooling dynamic network: Learning end-to-end dynamic characteristic for action recognition

    2018, Neurocomputing
    Citation Excerpt :

    Yet how to design segment-level consensus function remains to be an open problem. The hand-crafted methods, such as HOG, MBH [38,40] and RGB-D [39], can be utilized to encoding the convolutional features, and various consensus functions can lead to the significant difference in accuracy of action recognition. Motivated by 3D CNN and the de-coupled idea, there are many variants, e.g. Two-stream 3D CNN [17,18], Two-stream CNN + 3DCNN [13], to capture the long-range temporal property.

  • The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences

    2017, Knowledge-Based Systems
    Citation Excerpt :

    By using the multiple kernel learning (MKL) technique, Althloothi et al. [24] fused shape features extracted from the frequency domain and human joint positions at the kernel level for human activity recognition. Liu et al. [25] combined the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor and proposed a linear coding framework to obtain an effective feature vector for action classification. By fusing sequential RGB and depth information, Liu et al. [26] proposed coupled hidden conditional random fields (cHCRF) to learn sequence-specific and sequence-shared temporal structures.

View all citing articles on Scopus

Huaping Liu received the Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in 2004. He is currently an Associate Professor in the Department of Computer Science and Technology at Tsinghua University. His research interests include intelligent control and robotics.

Mingyi Yuan received the Bachelor degree from the Department of Physics at Peking University in 2007, and the Master degree from Department of Computer Science and Technology at Tsinghua University in 2013. He is now with Microsoft Asia-Pacific R&D Group. His research interests include computer vision and machine learning.

Fuchun Sun received the Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in 1998. Now he is a full professor in this department. He serves as associated editors of IEEE Transactions on Fuzzy Systems and Mechatronics, and members of the Editorial Board of the International Journal of Robotics and Autonomous Systems, International Journal of Control, Automation, and Systems, Science in China Series F: Information Science and Acta Automatica Sinica. His research interests include intelligent control, neural networks, fuzzy systems, and robot teleoperation.

View full text