Action recognition by spatio-temporal oriented energies
Introduction
Human action recognition [22], [2], [9] has been extensively researched in computer vision. Its potential applications can be found in many areas such as visual surveillance, video indexing/retrieval, sport event analysis and human computer interaction. However, action recognition is a challenging task mainly due to difficulties including large intra-class variations (i.e., the same action performed by different actors would differ significantly) and inter-class similarities (e.g., ‘running’ and ‘jogging’ appear rather similar). Existing action recognition systems are mainly focused on local and holistic representations.
Local representations using sparsely detected spatio-temporal interest points (STIPs) have dominated in human action recognition in the last decade. The popularity of local methods, e.g., the bag-of-words (BoW) model, results from attractive advantages, such as being less sensitive to partial occlusions and clutter and requiring no background subtraction or target tracking used in most of holistic methods. Nevertheless, local methods also suffer from some limitations, one of which is the inability to capture adequate spatial and temporal structure information of actions.
Holistic representations directly extract spatio-temporal features from raw video sequences rather than applying sparse sampling by STIP detectors. The advantage of such representations is that they are supplied with the entire spatial and temporal structural information of the human action in a sequence. However, to prevent the interference of background variations, accurate preprocessing steps such as background subtraction, segmentation and tracking are usually required.
In both local and holistic representations, low-level features serve as the fundamental role in human action representations. These features appear in the spatio-temporal volumes at arbitrary orientations and carry important features of actions, e.g., appearance and motion. To detect such features, a set of oriented filter kernels are always applied to each possible orientation, which however would be computationally expensive. Oriented filters have been often used in image processing. Features based on oriented gradients have been widely used and successfully extended from the image domain into video analysis and action recognition [8], [27]. In the image domain, Freeman et al. [15] proposed steerable filters to efficiently synthesize filters of arbitrary orientations for linear combinations of basis filters.
Adelson and Bergen [1] introduced a class of models for analysis of human motion mechanisms in which the first stage consists of linear filters that are oriented in space–time and tuned in spatial frequency. The outputs of the quadrature pairs of such filters are squared and summed to give a measure of motion energy. Energy models can be built from elements that are consistent with known physiology and psychophysics, and they permit a qualitative understanding of a variety of motion phenomena.
Extracting oriented spatio-temporal features based on steerable filters for video analysis has been well researched in previous works [49], [10], [12], [11], [6]. The use of steerable filters for the spatio-temporal data analysis can date back to the work by Wildes and Bergen [49]. They provided an avenue to perform qualitative analysis of spatio-temporal patterns that capture the underlying salient structures in video sequences. Local energy representations based on the quadrature outputs of the steerable filters were also used in their work, which is deemed as the foundation for analyzing spatio-temporal data using steerable filters.
By extending the two-dimensional steerable filters into three dimensions, Derpanis and Gryn [10] described the details of the construction of the Nth derivative of Gaussian separable steerable filters in the three-dimensional space. The separable and steerable implementations lead to efficient computation of steerable filters.
In light of the previous work, the local oriented energy representations have been utilized to spatio-temporal grouping [12], efficient action spotting [11] and visual tracking [6]. Derpanis and Wildes [12] adopted the oriented energy representation for grouping raw image data into a set of coherent spatiotemporal regions. This representation describes the presence of particular oriented spatio-temporal structures in a distributed manner to capture multiple oriented structures at a given location. They further designed a descriptor based on the oriented energy measurements for action spotting [11]. Slightly different from [12], in [11] the local energies are calculated based on a third order Gaussian derivative rather than the quadrature outputs of the steerable filters.
In the same spirit, Cannons et al. [6] developed a pixel-wise spatio-temporal oriented energy representation for visual tracking. Distinguished from [12], [11], a multi-scale Gaussian steerable filter was used. The representation includes appearance and motion information as well as information about how these descriptors are spatially arranged.
Our work is motivated by the fact that a video sequence with motion could be represented as a single pattern in the X–Y–T space, in which a velocity of motion corresponds to a three-dimensional orientation in this space. Motion information can be extracted by a system that responds to the oriented spatiotemporal energy. In addition, spatio-temporal features reside in different scales and can be extracted by multi-scale analysis. The steerable filters can efficiently perform multiple orientation analysis for videos while the Laplacian pyramid provides an effective multi-scale analysis. By combining the Laplacian pyramid and steerable filters, the STSP can detect non-orthogonal and over-complete features, which also shows the desirable property of shift and rotation invariance. It is a transform that combines the multi-scale decomposition with differential measurements, capturing the oriented structures in spatio-temporal volumes.
Inspired by the success of steerable filters in object classification [32] and video analysis [49], we introduce a novel holistic representation based on the spatio-temporal steerable pyramid (STSP) for action recognition. In contrast to previous holistic methods, our method based on the STSP can to a large extent handle the deficits of holistic representations and provides an informative, compact representation of human actions.
Note that this paper is an extension of the work in [54]. In the current version, we generalize the STSP by extending it from intensity to gradients and optical flow, and comprehensive experiments on the investigation of parameter settings are conducted on more datasets.
Given a 3D volume, which in our case can be the intensity volume, optical flows and 3D gradients of a video sequence, a spatio-temporal Laplacian pyramid structure is first constructed. The volume is decomposed into a set of sub-band volumes, which can segregate and enhance spatio-temporal features residing in different scales.
To efficiently explore oriented patterns in video sequences, a bank of spatio-temporal steerable filters with different scales is then applied to each level of the obtained Laplacian pyramid. These filters are separable, steerable filters in three dimensions (X–Y–T) and therefore can be computed efficiently.
Motivated by the previous work, we employ a representation based on spatio-temporal local energies which are calculated from the quadrature pairs of responses of the filtering on voxels in each volume.
Finally, a feature pooling operation, i.e., max pooling, is performed between adjacent scales of the steerable filters and over local spatio-temporal neighbourhoods, which makes the final representation more robust and invariant to scaling and shifts. In addition, features become more compact after the max pooling. The flowchart of feature extraction is illustrated in Fig. 1.
The contributions of the proposed method can be summarized as follows: (1) a new model based on spatio-temporal steerable pyramid is proposed for action recognition; (2) local oriented energies as spatio-temporal features are first employed for holistic representation of human actions; (3) a spatio-temporal max pooling operation is proposed to adopt into the spatio-temporal steerable pyramid model, which leads to a robust and more compact representation.
We organize the rest of the paper as follows: in Section 2 related work is reviewed and we describe details of our method in Section 3, and show experimental results in Section 4. Finally, we conclude in Section 5.
Section snippets
Related work
Since the pioneer works in [40], [13], [26], local features, i.e., spatio-temporal interest points (STIPs), in conjunction with the Bag-of-Words (BoW) model, have become popular for the local representation of human actions. Effective local feature detectors and descriptors have been designed and served as the basis for local methods.
Local representation is less sensitive to partial occlusions and clutter and can avoid some preliminary steps used in holistic methods such as background
Feature extraction
The theories of multi-scale representation and orientation analysis have been widely researched and used for image and video analysis. We instantiate them in video analysis by combining the Laplacian pyramid and steerable filters, and propose the spatio-temporal steerable pyramid (STSP) framework for the holistic representation of human actions.
In order to fully exploit the information residing in a video sequence with actions occurring in it, we apply spatio-temporal steerable filters on
Experiments and results
We evaluate the proposed method, i.e., STSP, on the KTH dataset, the UCF Sports and the newly released HMDB51 dataset. Sample frames from the three datasets are illustrated in Fig. 5. A linear support vector machine (SVM) is used for action classification [7].
Conclusion
In this paper, we have introduced an efficient holistic representation, named spatio-temporal steerable pyramid (STSP), for human action recognition. By decomposing a video sequence with a Laplacian pyramid, spatio-temporal salient features with various sizes can be well localized and enhanced. Multi-scale steerable filters can efficiently extract features in multiple scales and orientations. The spatio-temporal max pooling operation makes features further compact but invariant and robust.
Acknowledgements
The authors acknowledge the support of the University of Sheffield, the China Scholarship Council (CSC), the National Natural Science Foundation of China (Grant No: 61125106), and Shaanxi Key Innovation Team of Science and Technology (Grant No: 2012KCT-04).
References (55)
- et al.
Human action recognition using shape and CLG-motion flow from multi-view image sequences
Pattern Recogn.
(2008) - et al.
Content-based retrieval of human actions from realistic video databases
Inform. Sci.
(2013) - et al.
Geometric and photometric invariant distinctive regions detection
Inform. Sci.
(2007) - et al.
Spatiotemporal energy models for the perception of motion
J. Opt. Soc. Am.
(1985) - et al.
The recognition of human movement using temporal templates
IEEE Trans. Pattern Anal. Mach. Intell.
(2002) - Y. Boureau, J. Ponce, Y. LeCun, A theoretical analysis of feature pooling in visual recognition, in: International...
- et al.
The Laplacian pyramid as a compact image code
IEEE Trans. Commun.
(1983) - K. Cannons, J. Gryn, R. Wildes, Visual tracking using a pixelwise spatiotemporal oriented energy representation, in:...
- et al.
LIBSVM: a library for support vector machines
ACM Trans. Intell. Syst. Technol.
(2011) - N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Conference on Computer Vision and...
LF-EME: local features with elastic manifold embedding for human action recognition
Neurocomputing
The design and use of steerable filters
IEEE Trans. Pattern Anal. Mach. Intell.
Actions as space-time shapes
IEEE Trans. Pattern Anal. Mach. Intell.
A fast learning algorithm for deep belief nets
Neural Comput.
A model of saliency-based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Mach. Intell.
Trajectory-based modeling of human actions with motion reference points
Motion interchange patterns for action recognition in unconstrained videos
Cited by (49)
Video-based bird posture recognition using dual feature-rates deep fusion convolutional neural network
2022, Ecological IndicatorsCitation Excerpt :The automatic recognition of bird behavior based on video and deep learning has incomparable advantages and potential because of the broad application of monitoring systems in bird breeding and protection. The methods based on deep learning extract rich feature information from video clips to realize the goal of recognition in which the feature fusion of different rates and the spatio-temporal features are effective ways to improve behavior recognition performance (Zhen et al., 2014; Yang et al., 2020). Therefore, we extract the spatio-temporal features and fuse the different rate features of behavior to emphasize the different movement behaviors of birds.
Online human action recognition based on incremental learning of weighted covariance descriptors
2018, Information SciencesRotative maximal pattern: A local coloring descriptor for object classification and recognition
2017, Information SciencesFrom handcrafted to learned representations for human action recognition: A survey
2016, Image and Vision ComputingCitation Excerpt :Also, since global approaches normally take the whole action video as the framework input and perform pooling or filtering on the video pixel level, it always results in a relatively simple architecture and thus requires less computational costs. Zhen et al. [53] propose a spatio-temporal steerable pyramid (STSP) action representation. By decomposing spatio-temporal volumes into band-passed sub-volumes, spatio-temporal patterns can present in multiple pyramid scales.
Statistical adaptive metric learning in visual action feature set recognition
2016, Image and Vision Computing