Elsevier

Pattern Recognition Letters

Volume 72, 1 March 2016, Pages 62-71
Pattern Recognition Letters

Discriminative human action classification using locality-constrained linear coding

https://doi.org/10.1016/j.patrec.2015.07.015Get rights and content

Highlights

  • We propose using the locality-constrained linear coding for action classification.

  • Our sequence descriptor includes cell, block, and subsequence descriptors.

  • We use maximum pooling and a logistic regression classifier to encode each sequence.

  • We demonstrate the effectiveness of our algorithm on both depth and RGB videos.

Abstract

We propose a Locality-constrained Linear Coding (LLC) based algorithm that captures discriminative information of human actions in spatio-temporal subsequences of videos. The input video is divided into equally spaced overlapping spatio-temporal subsequences. Each subsequence is further divided into blocks and then cells. The spatio-temporal information in each cell is represented by a Histogram of Oriented 3D Gradients (HOG3D). LLC is then used to encode each block. We show that LLC gives more stable and repetitive codes compared to the standard Sparse Coding. The final representation of a video sequence is obtained using logistic regression with ℓ2 regularization and classification is performed by a linear SVM. The proposed algorithm is applicable to conventional and depth videos. Experimental comparison with ten state-of-the-art methods on three depth video and two conventional video databases shows that the proposed method consistently achieves the best performance.

Introduction

Human action classification from videos is an important problem because of its many applications in video surveillance, human-computer interaction, sports analysis, and elderly health care [1], [33]. However, automatic classification of human actions in videos is a challenging problem. For colour videos, lighting conditions in the environment and clothing worn by the human subject can both affect the performance of the action classification algorithms. While these problems are eliminated in depth videos, issues about occlusion, loose clothing, and variations in style and execution speed of actions remain.

We propose a human action classification algorithm based on Locality-constrained Linear Coding (LLC). We demonstrate improved action classification performance for actions captured by both depth and colour videos (Fig. 1). [32] compared LLC versus SC and found that LLC holds more essential information than SC for object classification. In this paper and in our earlier work in this direction [18] we investigated the use of LLC for encoding human actions. We found that to favour sparsity SC tends to select quite different elements from the action feature dictionary even for the same action. This has an adverse effect as it increases the intra-action-class variation. The LLC, on the other hand, exerts locality constraints on the features and tends to select the same elements in the dictionary for the same action. Our research contributions are fourfold:

  • We propose using the locality-constrained linear coding for human action classification.

  • We propose a sequence descriptor for each action video to be constructed in a hierarchical fashion, including the computation of the cell descriptor, block descriptor, and subsequence descriptor.

  • We propose using maximum pooling and a logistic regression classification with ℓ2 regularization for action classification.

  • We demonstrate the effectiveness of the proposed algorithm over the existing techniques for improved action classification accuracy.

To show that LLC is effective for human action classification, we design our sequence descriptor based on LLC and compare its accuracy on human action classification with the descriptor computed using SC. Furthermore, we evaluate these sequence descriptors against ten state-of-the-art techniques. For the benchmark depth video datasets (MSRGesture3D, MSRAction3D, and MSRActionPairs3D), we compare our algorithm against the algorithms of

  • [11], where the local descriptor is based on the histogram of oriented 3D spatio-temporal gradients (or HOG3D);

  • [30], where the random occupancy patterns (ROP) are used;

  • [31], where an actionlet ensemble model is learned;

  • [17], where the histogram of oriented 4D normals (HON4D) features are used;

  • [35], where space time interest points from depth sequences (DSTIP) are used;

  • [20], where local features encoding variations of depths and depth gradients plus skeletal body joints are used with a random decision forest (RDF) classifier.

For the benchmark colour video datasets (Weizmann and UCFSports), we compare our algorithm with the algorithms of [11] and the following:

  • [38], where the mapping between densely-sampled feature patches and the votes in a spatio-temporal action Hough space is trained using random trees;

  • [42], where the HOG3D feature was computed within a small 3D cuboid centred at a space-time point and encoded in a sparse coding framework;

  • [13], where a figure-centric word representation is used;

  • [21], where spatio-temporal structures from clustering of point trajectories of body parts are used;

  • [28], where a deformable part model is generated for each action from a collection of examples, with actions being treated as spatio-temporal patterns in the colour videos.

In all of these experiments, our proposed algorithm has consistently shown improved performance compared to the existing algorithms.

Section snippets

Related work

Many approaches to human action classification and classification involve analyzing the colour videos directly [5], [6], [28], [40]. Since the release of the Kinect camera in 2011, an increasing number of human action classification papers targeting at analyzing depth videos, colour+depth videos, and/or skeletal data start to emerge [10], [12], [15], [17], [19], [20], [30], [31], [35], [37].

Some action classification algorithms exploited silhouette and edge pixels to form discriminative

Proposed algorithm

We define an action as a function operating on a three dimensional space. The three independent variables in this space are (x, y, t) and the dependent variable is depth (d) i.e., d=H(x,y,t). In this setting, each action is characterized by the variations of the depth values along the three dimensions.

Experiments and results

We evaluate the action classification performance of the proposed algorithm on both the actions captured by depth videos as well as by the RGB videos. For depth videos, we use three public datasets including MSRAction3D [15], [31], MSRGesture3D [12], and MSRActionPairs3D [17]. For colour data, we use two public datasets including Weizmann [2] and UCFSports [22].

The performance of our proposed algorithm is compared with ten state-of-the-art algorithms including [11], [13], [17], [20], [21], [28]

Conclusion and future work

In this paper a new action classification algorithm is proposed which is based on sparse coding with the locality constraint. The proposed algorithm can capture discriminative information for human action classification using the spatio-temporal sequences of the action videos. It can be applied to both colour and depth videos. The proposed algorithm has been evaluated on five publicly available action datasets including three depth and two colour datasets. Its performance has been compared with

Acknowledgement

This research was supported by ARC grant DP110102399.

References (42)

  • D. Weinland et al.

    A survey of vision-based methods for action representation, segmentation and recognition

    Comput. Vis. Image Underst.

    (2011)
  • J. Aggarwal et al.

    Human activity analysis: A review

    ACM Comput. Surv.

    (2011)
  • M. Blank et al.

    Actions as space-time shapes

    Proceedings of ICCV

    (2005)
  • A. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2001)
  • K. Chatfield et al.

    The devil is in the details: an evaluation of recent feature cording methods

    Proceedings of BMVC

    (2010)
  • P. Dollár et al.

    Behavior recognition via sparse spatio-temporal features

    Proceedings of ICCV

    (2005)
  • I. Everts et al.

    Evaluation of color STIPs for human action recognition

    Proceedings of CVPR

    (2013)
  • R.E. Fan et al.

    LIBLINEAR: a library for large linear classification

    J. Mach. Learn. Res.

    (2008)
  • H. Jégou et al.

    Aggregating local image descriptors into compact codes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • S.S. Keerthi et al.

    A sequential dual method for large scale multi-class linear svms

    Proceedings of ACM SIGKDD

    (2008)
  • C. Keskin et al.

    Real time hand pose estimation using depth sensors

    Proceedings of ICCVW

    (2011)
  • A. Kläser et al.

    A spatio-temporal descriptor based on 3D-gradients

    Proceedings of BMVC

    (2008)
  • A. Kurakin et al.

    A real time system for dynamic hand gesture recognition with a depth sensor

    Proceedings of EUSIPCO

    (2012)
  • T. Lan et al.

    Discriminative figure-centric models for joint action localization and recognition

    Proceedings of ICCV

    (2011)
  • I. Laptev

    On space-time interest point

    Int.J. Comput. Vis.

    (2005)
  • W. Li et al.

    Action recognition based on a bag of 3D points

    Proceedings of IEEE International Workshop on CVPR for Human Communicative Behavior Analysis (CVPR4HB)

    (2010)
  • D. Oneata, J.J. Verbeek, C. Schmid, Action and event recognition with Fisher vectors on a compact feature set, 2013,...
  • O. Oreifej et al.

    HON4D: histogram of oriented 4D normals for activity recognition from depth sequences

    Proceedings of CVPR

    (2013)
  • H. Rahmani et al.

    Action classification with locality-constrained linear coding

    Proceedings of ICPR

    (2014)
  • H. Rahmani et al.

    HOPC: Histogram of oriented principal components of 3D pointclouds for action recognition

    Proceedings of ECCV, Part II

    (2014)
  • H. Rahmani et al.

    Real time action recognition using histograms of depth gradients and random decision forests

    Proceedings of WACV

    (2014)
  • Cited by (41)

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Anders Heyden, Ph.D.

    View full text