RGB-D action recognition using linear coding

doi:10.1016/j.neucom.2013.12.061

Neurocomputing

Volume 149, Part A, 3 February 2015, Pages 79-85

https://doi.org/10.1016/j.neucom.2013.12.061 Get rights and content

Abstract

In this paper, we investigate action recognition using an inexpensive RGB-D sensor (Microsoft Kinect). First, a depth spatial-temporal descriptor is developed to extract the interested local regions in depth image. Such descriptors are very robust to the illumination and background clutter. Then the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor are combined and feeded into a linear coding framework to get an effective feature vector, which can be used for action classification. Finally, extensive experiments are conducted on a publicly available RGB-D action recognition dataset and the proposed method shows promising results.

Introduction

Recognition of human actions has been an active research topic in computer vision. In the past decade, research has mainly focused on learning and recognizing actions from video sequences captured from a single camera and rich literature can be found in a wide range of fields including computer vision, pattern recognition, machine leaning and signal processing. Recently, there are some approaches using local spatio-temporal descriptors together with bag-of-words model to represent the action. Since these approaches do not rely on any preprocessing techniques, e.g. foreground detection or body-part tracking, they are relatively robust to the change of viewpoint, noise, background, and illumination. However, most existing work on action recognition is based on color video, which leads to relatively low accuracy even when there is no clutter.

Different from these work, our motivation is driven by the application of the famous mass-production consumer electronics device Kinect, which provides a depth stream and a color stream. Kinect has been applied in extensive fields including people detection and tracking [1], [2]. Currently there exist very few work that utilize the color-depth sensor combination for human action recognition. For example, Ref. [3] used the depth information but totally ignored the depth information. In fact, as we will analyze, the color information and depth information can be complementary since the human actions are in essence three-dimensional. However, how to effectively fuse the color and depth information remains a great challenging problem. In this paper, we extract the local descriptors from the color and depth video and utilize the linear coding framework to integrate the color and depth information. The main contributions are summarized as follows:

1.
The conventional STIP descriptor is extended by incorporating depth information to deal with depth video. Such descriptors are very robust to the illumination and background clutter.
2.
A linear coding framework is developed to fuse the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor to form robust feature vector. In addition, we further exploit the temporal intrinsics of the video sequence and design a new pooling technology to improve the description performance.
3.
Extensive experiments are conducted on a publicly available RGB-D action recognition dataset and the proposed method shows promising results.

The organization of this paper is as follows: in Section 2 we introduce the feature extraction. 3 Coding approaches, 4 Pooling strategy present the coding and pooling methods, respectively. The experimental results are given in Section 5. Finally, Section 6 gives some conclusions.

Section snippets

Feature extraction

There are several schemes applied to time-consistent scene recognition problems. Some of them are statistics based approaches, such as Hidden Markov Models, Latent-Dynamic Discriminative Model [4], and so on. Differently, Space-Time Interest Points (STIPs) [5] regard the temporal axis as the same as the spatial axes and looks for the features along the temporal axis as well. We prefer to the latter ones because the time parameter of the sample is essentially the same as the space parameters in

Coding approaches

A popular method for coding is the vector quantization (VQ) method, which solves the following constrained least square fitting problem: $\min_{C} \sum_{i = 1}^{M} {‖ x_{i} - {Bc}_{i} ‖}_{2}^{2} s . t . ‖ c_{i} ‖_{0} = 1, ‖ c_{i} ‖_{1} = 1, c_{i} ≽ 0, \forall i,$ where $C = [c_{1}, c_{2}, \dots, c_{M}]$ is the set of codes for $X = [x_{1}, x_{2}, \dots, x_{M}]$ . The cardinality constraint $‖ c_{i} ‖_{0} = 1$ means that there will be only one non-zero element in each code $c_{i}$ , corresponding to the quantization id of $x_{i}$ . The non-negative, ℓ₁ constraint $‖ c_{i} ‖_{1} = 1$ , $c_{i} ≽ 0$ means that the coding weight for $x_{i}$ is 1. In practice, the

Pooling strategy

Similar to the VQ coding approach, the LLC coding coefficients $c_{i}$ are expected to be combined into a global representation of the sample for classification. In early work of VQ and LLC, SPM framework [12] is frequently used for pooling coding coefficients. For SPM, the image is first subdivided at several different levels of resolution, then for each level of resolution, the coding coefficients that fall in each spatial bin are summed and finally all the spatial histograms are weighted

Experimental results

In this section, we first introduce the details about the utilized dataset and the concerned methods. Then, we show the extensive experimental results in the second part.

Conclusion

In this paper, we perform action recognition using an inexpensive RGB-D sensor. A depth spatial-temporal descriptor is developed to extract the interested local regions in depth image. Such descriptors are very robust to the illumination and background clutter. Further, the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor are combined into the linear coding framework and an effective feature vectors can be constructed for action classification. Finally, extensive

Acknowledgement

This work was supported by the National Key Project for Basic Research of China (Grant no. 2013CB329403), the National Natural Science Foundation of China (Grant nos. 61075027, 91120011 and 61210013), and Tsinghua Self-innovation Project (Grant no. 20111081111) and in part by the Tsinghua University Initiative Scientific Research Program (Grant no. 20131089295).

Huaping Liu received the Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in 2004. He is currently an Associate Professor in the Department of Computer Science and Technology at Tsinghua University. His research interests include intelligent control and robotics.

References (14)

B. Ni, G. Wang, M. Pierre, RGBD-HuDaAct: a color-depth video database for human daily activity recognition, in: IEEE...
J. Sung, C. Ponce, B. Selman, A. Saxena, Unstructured human activity detection from RGBD images, in: IEEE International...
W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3d points, in: IEEE Conference on Computer Vision and...
L. Morency, A. Quattoni, T. Darrell, Latent-dynamic discriminative models for continuous gesture recognition, in: IEEE...
I. Laptev
On space-time interest points
Int. J. Comput. Vis.
(2005)
N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Conference on Computer Vision and...
N. Dalal, B. Triggs, C. Schmid, Human detection using oriented histograms of flow and appearance, in: European...

There are more references available in the full text version of this article.

Cited by (16)

A two-level attention-based interaction model for multi-person activity recognition
2018, Neurocomputing
Citation Excerpt :
A large number of works have concentrated on action recognition in RGB or RGB-D data [29–31]. Liu et al. [29] investigate action recognition using an inexpressive RGB-D sensor. Here, we note that actions of persons can be described by the evolutions of a series of human poses.
Multi-person activity recognition is a challenging task due to its elusive interactions in activities. We take into account these interactions at two levels. At the individual level, each person behaves depending on both its spatio-temporal features and interactions propagated from others in the scene. At the scene level, the multi-person activity is characterized by interactions between individuals’ actions and the high-level activity. It is worth noting that interactions contribute unequally at both levels. To jointly explore these colorful interactions, we propose a two-level attention-based interaction model relying on two time-varying attention mechanisms. The individual-level attention mechanism conditioned on pose features, exploits various degrees of interactions among individuals in a scene while updating their states at each time step. The scene-level attention mechanism proposes an attention-based pooling strategy to explore various levels of interactions between individuals’ actions and the high-level activity. We ground our model by a modified two-stage Gated Recurrent Units (GRUs) network to handle the long-range temporal variability and consistency. Our end-to-end trainable model takes as inputs a set of person detections in videos or image sequences and predicts labels of multi-person activities. Experimental results demonstrate comparable performance of our model and show the effectiveness of our attention mechanisms.
Rank pooling dynamic network: Learning end-to-end dynamic characteristic for action recognition
2018, Neurocomputing
Citation Excerpt :
Yet how to design segment-level consensus function remains to be an open problem. The hand-crafted methods, such as HOG, MBH [38,40] and RGB-D [39], can be utilized to encoding the convolutional features, and various consensus functions can lead to the significant difference in accuracy of action recognition. Motivated by 3D CNN and the de-coupled idea, there are many variants, e.g. Two-stream 3D CNN [17,18], Two-stream CNN + 3DCNN [13], to capture the long-range temporal property.
In video recognition, rank-pooling operators are a type of models for sorting video sequences, which act on either the raw inputs or the intermediate feature maps of convolutional neural network (CNN). However, such models are currently restricted in the optimization of the linear ranking function by Rank-SVM and Rank-SVR. In this paper, we first propose a CNN architecture called RGB Rank Pooling Dynamic Network (RGB-RPDN), mapping a video to multiple frame-level dynamic spaces with the same size as the input. Importantly, a classical classification (e.g. FC, CNN) advanced in 2D image can be jointly positioned behind the generated representation for action classification, thus the joint architecture can be trained in an end-to-end manner. Second, we analyze how the flow-level evolution can be modeled by the hand-crafted rank-pooling machine, and extend the dynamic space of frame-level to that of flow-level by the Flow Rank Pooling Dynamic Network (Flow-RPDN). Third, equivalence relations between hand-crafted rank-pooling and RPDN are formulated, further the comparison of computing cost are qualitatively analyzed. Finally, the frame-level and flow-level pipelines are combined to achieve the final prediction by the late fusion. Specifically, with the models pre-trained on the large-scale Kinetics dataset, we train the two-stream RPDN on the UCF101 and HMDB51, where the parameters are initialized by the pre-trained models above. Experimental results demonstrate that the RPDN significantly improves the hand-crafted rank-pooling machines by a large margin of promotion, and achieves the correct rate of more excellent classification in action recognition.
Multi-view clustering based on graph-regularized nonnegative matrix factorization for object recognition
2018, Information Sciences
Various datasets from sensors are used for object recognition, and different features may be extracted from the same dataset in processing. Different datasets thus describe representations or views of the same object. Fusing the information from this multi-view dataset can improve recognition performance. However, such different views have varying quality levels. In this paper, we discuss multi-view clustering based on graph-regularized nonnegative matrix factorization with fusing useful information effectively to improve recognition accuracy. Useful information is enhanced via graph embedding, and redundant information is removed using the orthogonal constraint in each view for clustering. Experimental results on several real datasets demonstrate the effectiveness of our approach in improving the clustering performance of datasets.
Semi-supervised convex nonnegative matrix factorizations with graph regularized for image representation
2017, Neurocomputing
Non-negative matrix factorization (NMF) is a very effective method for high dimensional data analysis, which has been widely used in computer vision. It can capture the underlying structure of image in the low dimensional space using its parts-based representations. However, nonnegative entries are usually required for the data matrix in NMF, which limits its application. Besides, it is actually an unsupervised method without making use of prior information of data. In this paper, we propose a novel method called Pairwise constrained Graph Regularized Convex Nonnegative Matrix Factorization (PGCNMF), which not only allows the processing of mixed-sign data matrix but also incorporates pairwise constraints generated among all labeled data into Convex NMF framework. We expect that images which have the same class label will have very similar representations in the low dimensional space as much as possible, while images with different class labels will have dissimilar representations as much as possible. Clustering experiments on nonnegative and mixed-sign real-world image datasets are conducted to demonstrate the effectiveness of the proposed method.
The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences
2017, Knowledge-Based Systems
Citation Excerpt :
By using the multiple kernel learning (MKL) technique, Althloothi et al. [24] fused shape features extracted from the frequency domain and human joint positions at the kernel level for human activity recognition. Liu et al. [25] combined the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor and proposed a linear coding framework to obtain an effective feature vector for action classification. By fusing sequential RGB and depth information, Liu et al. [26] proposed coupled hidden conditional random fields (cHCRF) to learn sequence-specific and sequence-shared temporal structures.
Depth sequences are useful for action recognition since they are insensitive to illumination variation and provide geometric information. Many current action recognition methods are limited by being computationally expensive and requiring large-scale training data. Here we propose an effective method for human action recognition using depth sequences captured by depth cameras. A multi-resolution operation, the spatial Laplacian and temporal energy pyramid (SLTEP), decomposes the depth sequences into certain frequency bands in different space and time positions. A spatial aggregating and fusion scheme is applied to cluster the low-level features and concatenate two different feature types extracted from low and high frequency levels, respectively. We evaluate our approach on five public benchmark datasets (MSRAction3D, MSRGesture3D, MSRActionPairs, MSRDailyActivity3D, and NTU RGB+D) and demonstrate its advantages over existing methods and is likely to be highly useful for online applications.
Combining depth-skeleton feature with sparse coding for action recognition
2017, Neurocomputing
RGB-D human action recognition is a very active research topic in computer vision and robotics. In this paper, an action recognition method that combines gradient information and sparse coding is proposed. First of all, we leverage depth gradient information and distance of skeleton joints to extract coarse Depth-Skeleton (DS) feature. Then, the sparse coding and max pooling are combined to refine the coarse DS feature. Finally, the Random Decision Forests (RDF) is utilized to perform action recognition. Experimental results on three public datasets show the superior performance of our method.

View all citing articles on Scopus

Mingyi Yuan received the Bachelor degree from the Department of Physics at Peking University in 2007, and the Master degree from Department of Computer Science and Technology at Tsinghua University in 2013. He is now with Microsoft Asia-Pacific R&D Group. His research interests include computer vision and machine learning.

Fuchun Sun received the Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in 1998. Now he is a full professor in this department. He serves as associated editors of IEEE Transactions on Fuzzy Systems and Mechatronics, and members of the Editorial Board of the International Journal of Robotics and Autonomous Systems, International Journal of Control, Automation, and Systems, Science in China Series F: Information Science and Acta Automatica Sinica. His research interests include intelligent control, neural networks, fuzzy systems, and robot teleoperation.

View full text