Discriminative human action classification using locality-constrained linear coding

doi:10.1016/j.patrec.2015.07.015

Pattern Recognition Letters

Volume 72, 1 March 2016, Pages 62-71

https://doi.org/10.1016/j.patrec.2015.07.015 Get rights and content

Highlights

•
We propose using the locality-constrained linear coding for action classification.
•
Our sequence descriptor includes cell, block, and subsequence descriptors.
•
We use maximum pooling and a logistic regression classifier to encode each sequence.
•
We demonstrate the effectiveness of our algorithm on both depth and RGB videos.

Abstract

We propose a Locality-constrained Linear Coding (LLC) based algorithm that captures discriminative information of human actions in spatio-temporal subsequences of videos. The input video is divided into equally spaced overlapping spatio-temporal subsequences. Each subsequence is further divided into blocks and then cells. The spatio-temporal information in each cell is represented by a Histogram of Oriented 3D Gradients (HOG3D). LLC is then used to encode each block. We show that LLC gives more stable and repetitive codes compared to the standard Sparse Coding. The final representation of a video sequence is obtained using logistic regression with ℓ₂ regularization and classification is performed by a linear SVM. The proposed algorithm is applicable to conventional and depth videos. Experimental comparison with ten state-of-the-art methods on three depth video and two conventional video databases shows that the proposed method consistently achieves the best performance.

Introduction

Human action classification from videos is an important problem because of its many applications in video surveillance, human-computer interaction, sports analysis, and elderly health care [1], [33]. However, automatic classification of human actions in videos is a challenging problem. For colour videos, lighting conditions in the environment and clothing worn by the human subject can both affect the performance of the action classification algorithms. While these problems are eliminated in depth videos, issues about occlusion, loose clothing, and variations in style and execution speed of actions remain.

We propose a human action classification algorithm based on Locality-constrained Linear Coding (LLC). We demonstrate improved action classification performance for actions captured by both depth and colour videos (Fig. 1). [32] compared LLC versus SC and found that LLC holds more essential information than SC for object classification. In this paper and in our earlier work in this direction [18] we investigated the use of LLC for encoding human actions. We found that to favour sparsity SC tends to select quite different elements from the action feature dictionary even for the same action. This has an adverse effect as it increases the intra-action-class variation. The LLC, on the other hand, exerts locality constraints on the features and tends to select the same elements in the dictionary for the same action. Our research contributions are fourfold:

•
We propose using the locality-constrained linear coding for human action classification.
•
We propose a sequence descriptor for each action video to be constructed in a hierarchical fashion, including the computation of the cell descriptor, block descriptor, and subsequence descriptor.
•
We propose using maximum pooling and a logistic regression classification with ℓ₂ regularization for action classification.
•
We demonstrate the effectiveness of the proposed algorithm over the existing techniques for improved action classification accuracy.

To show that LLC is effective for human action classification, we design our sequence descriptor based on LLC and compare its accuracy on human action classification with the descriptor computed using SC. Furthermore, we evaluate these sequence descriptors against ten state-of-the-art techniques. For the benchmark depth video datasets (MSRGesture3D, MSRAction3D, and MSRActionPairs3D), we compare our algorithm against the algorithms of

•
[11], where the local descriptor is based on the histogram of oriented 3D spatio-temporal gradients (or HOG3D);
•
[30], where the random occupancy patterns (ROP) are used;
•
[31], where an actionlet ensemble model is learned;
•
[17], where the histogram of oriented 4D normals (HON4D) features are used;
•
[35], where space time interest points from depth sequences (DSTIP) are used;
•
[20], where local features encoding variations of depths and depth gradients plus skeletal body joints are used with a random decision forest (RDF) classifier.

For the benchmark colour video datasets (Weizmann and UCFSports), we compare our algorithm with the algorithms of [11] and the following:

•
[38], where the mapping between densely-sampled feature patches and the votes in a spatio-temporal action Hough space is trained using random trees;
•
[42], where the HOG3D feature was computed within a small 3D cuboid centred at a space-time point and encoded in a sparse coding framework;
•
[13], where a figure-centric word representation is used;
•
[21], where spatio-temporal structures from clustering of point trajectories of body parts are used;
•
[28], where a deformable part model is generated for each action from a collection of examples, with actions being treated as spatio-temporal patterns in the colour videos.

In all of these experiments, our proposed algorithm has consistently shown improved performance compared to the existing algorithms.

Section snippets

Related work

Many approaches to human action classification and classification involve analyzing the colour videos directly [5], [6], [28], [40]. Since the release of the Kinect camera in 2011, an increasing number of human action classification papers targeting at analyzing depth videos, colour+depth videos, and/or skeletal data start to emerge [10], [12], [15], [17], [19], [20], [30], [31], [35], [37].

Some action classification algorithms exploited silhouette and edge pixels to form discriminative

Proposed algorithm

We define an action as a function operating on a three dimensional space. The three independent variables in this space are (x, y, t) and the dependent variable is depth (d) i.e., $d = H (x, y, t)$ . In this setting, each action is characterized by the variations of the depth values along the three dimensions.

Experiments and results

We evaluate the action classification performance of the proposed algorithm on both the actions captured by depth videos as well as by the RGB videos. For depth videos, we use three public datasets including MSRAction3D [15], [31], MSRGesture3D [12], and MSRActionPairs3D [17]. For colour data, we use two public datasets including Weizmann [2] and UCFSports [22].

The performance of our proposed algorithm is compared with ten state-of-the-art algorithms including [11], [13], [17], [20], [21], [28]

Conclusion and future work

In this paper a new action classification algorithm is proposed which is based on sparse coding with the locality constraint. The proposed algorithm can capture discriminative information for human action classification using the spatio-temporal sequences of the action videos. It can be applied to both colour and depth videos. The proposed algorithm has been evaluated on five publicly available action datasets including three depth and two colour datasets. Its performance has been compared with

Acknowledgement

This research was supported by ARC grant DP110102399.

References (42)

D. Weinland et al.
A survey of vision-based methods for action representation, segmentation and recognition
Comput. Vis. Image Underst.
(2011)
J. Aggarwal et al.
Human activity analysis: A review
ACM Comput. Surv.
(2011)
M. Blank et al.
Actions as space-time shapes
Proceedings of ICCV
(2005)
A. Bobick et al.
The recognition of human movement using temporal templates
IEEE Trans. Pattern Anal. Mach. Intell.
(2001)
K. Chatfield et al.
The devil is in the details: an evaluation of recent feature cording methods
Proceedings of BMVC
(2010)
P. Dollár et al.
Behavior recognition via sparse spatio-temporal features
Proceedings of ICCV
(2005)
I. Everts et al.
Evaluation of color STIPs for human action recognition
Proceedings of CVPR
(2013)
R.E. Fan et al.
LIBLINEAR: a library for large linear classification
J. Mach. Learn. Res.
(2008)
H. Jégou et al.
Aggregating local image descriptors into compact codes
IEEE Trans. Pattern Anal. Mach. Intell.
(2012)
S.S. Keerthi et al.
A sequential dual method for large scale multi-class linear svms
Proceedings of ACM SIGKDD
(2008)

C. Keskin et al.

Real time hand pose estimation using depth sensors

Proceedings of ICCVW

(2011)

A. Kläser et al.

A spatio-temporal descriptor based on 3D-gradients

Proceedings of BMVC

(2008)

A. Kurakin et al.

A real time system for dynamic hand gesture recognition with a depth sensor

Proceedings of EUSIPCO

(2012)

T. Lan et al.

Discriminative figure-centric models for joint action localization and recognition

Proceedings of ICCV

(2011)

I. Laptev

On space-time interest point

Int.J. Comput. Vis.

(2005)

W. Li et al.

Action recognition based on a bag of 3D points

Proceedings of IEEE International Workshop on CVPR for Human Communicative Behavior Analysis (CVPR4HB)

(2010)

D. Oneata, J.J. Verbeek, C. Schmid, Action and event recognition with Fisher vectors on a compact feature set, 2013,...

O. Oreifej et al.

HON4D: histogram of oriented 4D normals for activity recognition from depth sequences

Proceedings of CVPR

(2013)

H. Rahmani et al.

Action classification with locality-constrained linear coding

Proceedings of ICPR

(2014)

H. Rahmani et al.

HOPC: Histogram of oriented principal components of 3D pointclouds for action recognition

Proceedings of ECCV, Part II

(2014)

H. Rahmani et al.

Real time action recognition using histograms of depth gradients and random decision forests

Proceedings of WACV

(2014)

Cited by (41)

Vision-based human action recognition: An overview and real world challenges
2020, Forensic Science International: Digital Investigation
Citation Excerpt :
The HAR problem becomes more and more complex because of the view invariance. Most introduced methods have addressed the HAR problem (Wang et al., 2009; Wang and Schmid, 2013; Rahmani et al., 2014a; Simonyan and Zisserman, 2014; Rahmani et al., 1408; Rahmani et al., 2014b; Rahmani et al., 2016a; Shahroudy et al., 2016a; Liu et al., 2017b; Dai et al., 2017; Zhao et al., 1712; Nazir et al., 2018) from a fixed viewpoint. While these methods are very successful using a single viewpoint, their performance drops significantly under many viewpoints.
Within a large range of applications in computer vision, Human Action Recognition has become one of the most attractive research fields. Ambiguities in recognizing actions does not only come from the difficulty to define the motion of body parts, but also from many other challenges related to real world problems such as camera motion, dynamic background, and bad weather conditions. There has been little research work in the real world conditions of human action recognition systems, which encourages us to seriously search in this application domain. Although a plethora of robust approaches have been introduced in the literature, they are still insufficient to fully cover the challenges. To quantitatively and qualitatively compare the performance of these methods, public datasets that present various actions under several conditions and constraints are recorded. In this paper, we investigate an overview of the existing methods according to the kind of issue they address. Moreover, we present a comparison of the existing datasets introduced for the human action recognition field.
Enhancing the performance of 3D auto-correlation gradient features in depth action classification
2022, International Journal of Multimedia Information Retrieval
Gearbox degradation assessment based on a sparse representation feature and Euclidean distance technique
2022, Australian Journal of Mechanical Engineering
ANALYSIS AND EVALUATION OF KINECT-BASED ACTION RECOGNITION ALGORITHMS
2021, arXiv
Non-intrusive load monitoring system for similar loads identification using feature mapping and deep learning techniques
2021, Measurement Science and Technology
Improving Human Action Recognition Using Hierarchical Features and Multiple Classifier Ensembles
2021, Computer Journal

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by Anders Heyden, Ph.D.

View full text

Discriminative human action classification using locality-constrained linear coding☆

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed algorithm

Experiments and results

Conclusion and future work

Acknowledgement

Comput. Vis. Image Underst.

Human activity analysis: A review

ACM Comput. Surv.

Actions as space-time shapes

Proceedings of ICCV

The recognition of human movement using temporal templates

IEEE Trans. Pattern Anal. Mach. Intell.

The devil is in the details: an evaluation of recent feature cording methods

Proceedings of BMVC

Behavior recognition via sparse spatio-temporal features

Proceedings of ICCV

Evaluation of color STIPs for human action recognition

Proceedings of CVPR

LIBLINEAR: a library for large linear classification

J. Mach. Learn. Res.

Aggregating local image descriptors into compact codes

IEEE Trans. Pattern Anal. Mach. Intell.

A sequential dual method for large scale multi-class linear svms

Proceedings of ACM SIGKDD

Real time hand pose estimation using depth sensors

Proceedings of ICCVW

A spatio-temporal descriptor based on 3D-gradients

Proceedings of BMVC

A real time system for dynamic hand gesture recognition with a depth sensor

Proceedings of EUSIPCO

Discriminative figure-centric models for joint action localization and recognition

Proceedings of ICCV

On space-time interest point

Int.J. Comput. Vis.

Action recognition based on a bag of 3D points

Proceedings of IEEE International Workshop on CVPR for Human Communicative Behavior Analysis (CVPR4HB)

HON4D: histogram of oriented 4D normals for activity recognition from depth sequences

Proceedings of CVPR

Action classification with locality-constrained linear coding

Proceedings of ICPR

HOPC: Histogram of oriented principal components of 3D pointclouds for action recognition

Proceedings of ECCV, Part II

Real time action recognition using histograms of depth gradients and random decision forests

Proceedings of WACV