Discriminative human action classification using locality-constrained linear coding☆
Introduction
Human action classification from videos is an important problem because of its many applications in video surveillance, human-computer interaction, sports analysis, and elderly health care [1], [33]. However, automatic classification of human actions in videos is a challenging problem. For colour videos, lighting conditions in the environment and clothing worn by the human subject can both affect the performance of the action classification algorithms. While these problems are eliminated in depth videos, issues about occlusion, loose clothing, and variations in style and execution speed of actions remain.
We propose a human action classification algorithm based on Locality-constrained Linear Coding (LLC). We demonstrate improved action classification performance for actions captured by both depth and colour videos (Fig. 1). [32] compared LLC versus SC and found that LLC holds more essential information than SC for object classification. In this paper and in our earlier work in this direction [18] we investigated the use of LLC for encoding human actions. We found that to favour sparsity SC tends to select quite different elements from the action feature dictionary even for the same action. This has an adverse effect as it increases the intra-action-class variation. The LLC, on the other hand, exerts locality constraints on the features and tends to select the same elements in the dictionary for the same action. Our research contributions are fourfold:
- •
We propose using the locality-constrained linear coding for human action classification.
- •
We propose a sequence descriptor for each action video to be constructed in a hierarchical fashion, including the computation of the cell descriptor, block descriptor, and subsequence descriptor.
- •
We propose using maximum pooling and a logistic regression classification with ℓ2 regularization for action classification.
- •
We demonstrate the effectiveness of the proposed algorithm over the existing techniques for improved action classification accuracy.
To show that LLC is effective for human action classification, we design our sequence descriptor based on LLC and compare its accuracy on human action classification with the descriptor computed using SC. Furthermore, we evaluate these sequence descriptors against ten state-of-the-art techniques. For the benchmark depth video datasets (MSRGesture3D, MSRAction3D, and MSRActionPairs3D), we compare our algorithm against the algorithms of
- •
[11], where the local descriptor is based on the histogram of oriented 3D spatio-temporal gradients (or HOG3D);
- •
[30], where the random occupancy patterns (ROP) are used;
- •
[31], where an actionlet ensemble model is learned;
- •
[17], where the histogram of oriented 4D normals (HON4D) features are used;
- •
[35], where space time interest points from depth sequences (DSTIP) are used;
- •
[20], where local features encoding variations of depths and depth gradients plus skeletal body joints are used with a random decision forest (RDF) classifier.
For the benchmark colour video datasets (Weizmann and UCFSports), we compare our algorithm with the algorithms of [11] and the following:
- •
[38], where the mapping between densely-sampled feature patches and the votes in a spatio-temporal action Hough space is trained using random trees;
- •
[42], where the HOG3D feature was computed within a small 3D cuboid centred at a space-time point and encoded in a sparse coding framework;
- •
[13], where a figure-centric word representation is used;
- •
[21], where spatio-temporal structures from clustering of point trajectories of body parts are used;
- •
[28], where a deformable part model is generated for each action from a collection of examples, with actions being treated as spatio-temporal patterns in the colour videos.
In all of these experiments, our proposed algorithm has consistently shown improved performance compared to the existing algorithms.
Section snippets
Related work
Many approaches to human action classification and classification involve analyzing the colour videos directly [5], [6], [28], [40]. Since the release of the Kinect camera in 2011, an increasing number of human action classification papers targeting at analyzing depth videos, colour+depth videos, and/or skeletal data start to emerge [10], [12], [15], [17], [19], [20], [30], [31], [35], [37].
Some action classification algorithms exploited silhouette and edge pixels to form discriminative
Proposed algorithm
We define an action as a function operating on a three dimensional space. The three independent variables in this space are (x, y, t) and the dependent variable is depth (d) i.e., . In this setting, each action is characterized by the variations of the depth values along the three dimensions.
Experiments and results
We evaluate the action classification performance of the proposed algorithm on both the actions captured by depth videos as well as by the RGB videos. For depth videos, we use three public datasets including MSRAction3D [15], [31], MSRGesture3D [12], and MSRActionPairs3D [17]. For colour data, we use two public datasets including Weizmann [2] and UCFSports [22].
The performance of our proposed algorithm is compared with ten state-of-the-art algorithms including [11], [13], [17], [20], [21], [28]
Conclusion and future work
In this paper a new action classification algorithm is proposed which is based on sparse coding with the locality constraint. The proposed algorithm can capture discriminative information for human action classification using the spatio-temporal sequences of the action videos. It can be applied to both colour and depth videos. The proposed algorithm has been evaluated on five publicly available action datasets including three depth and two colour datasets. Its performance has been compared with
Acknowledgement
This research was supported by ARC grant DP110102399.
References (42)
- et al.
A survey of vision-based methods for action representation, segmentation and recognition
Comput. Vis. Image Underst.
(2011) - et al.
Human activity analysis: A review
ACM Comput. Surv.
(2011) - et al.
Actions as space-time shapes
Proceedings of ICCV
(2005) - et al.
The recognition of human movement using temporal templates
IEEE Trans. Pattern Anal. Mach. Intell.
(2001) - et al.
The devil is in the details: an evaluation of recent feature cording methods
Proceedings of BMVC
(2010) - et al.
Behavior recognition via sparse spatio-temporal features
Proceedings of ICCV
(2005) - et al.
Evaluation of color STIPs for human action recognition
Proceedings of CVPR
(2013) - et al.
LIBLINEAR: a library for large linear classification
J. Mach. Learn. Res.
(2008) - et al.
Aggregating local image descriptors into compact codes
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) - et al.
A sequential dual method for large scale multi-class linear svms
Proceedings of ACM SIGKDD
(2008)
Real time hand pose estimation using depth sensors
Proceedings of ICCVW
A spatio-temporal descriptor based on 3D-gradients
Proceedings of BMVC
A real time system for dynamic hand gesture recognition with a depth sensor
Proceedings of EUSIPCO
Discriminative figure-centric models for joint action localization and recognition
Proceedings of ICCV
On space-time interest point
Int.J. Comput. Vis.
Action recognition based on a bag of 3D points
Proceedings of IEEE International Workshop on CVPR for Human Communicative Behavior Analysis (CVPR4HB)
HON4D: histogram of oriented 4D normals for activity recognition from depth sequences
Proceedings of CVPR
Action classification with locality-constrained linear coding
Proceedings of ICPR
HOPC: Histogram of oriented principal components of 3D pointclouds for action recognition
Proceedings of ECCV, Part II
Real time action recognition using histograms of depth gradients and random decision forests
Proceedings of WACV
Cited by (41)
Vision-based human action recognition: An overview and real world challenges
2020, Forensic Science International: Digital InvestigationCitation Excerpt :The HAR problem becomes more and more complex because of the view invariance. Most introduced methods have addressed the HAR problem (Wang et al., 2009; Wang and Schmid, 2013; Rahmani et al., 2014a; Simonyan and Zisserman, 2014; Rahmani et al., 1408; Rahmani et al., 2014b; Rahmani et al., 2016a; Shahroudy et al., 2016a; Liu et al., 2017b; Dai et al., 2017; Zhao et al., 1712; Nazir et al., 2018) from a fixed viewpoint. While these methods are very successful using a single viewpoint, their performance drops significantly under many viewpoints.
Enhancing the performance of 3D auto-correlation gradient features in depth action classification
2022, International Journal of Multimedia Information RetrievalGearbox degradation assessment based on a sparse representation feature and Euclidean distance technique
2022, Australian Journal of Mechanical EngineeringNon-intrusive load monitoring system for similar loads identification using feature mapping and deep learning techniques
2021, Measurement Science and Technology
- ☆
This paper has been recommended for acceptance by Anders Heyden, Ph.D.