Elsevier

Information Sciences

Volume 281, 10 October 2014, Pages 295-309
Information Sciences

Action recognition by spatio-temporal oriented energies

https://doi.org/10.1016/j.ins.2014.05.021Get rights and content

Abstract

In this paper, we present a unified representation based on the spatio-temporal steerable pyramid (STSP) for the holistic representation of human actions. A video sequence is viewed as a spatio-temporal volume preserving all the appearance and motion information of an action in it. By decomposing the spatio-temporal volumes into band-passed sub-volumes, the spatio-temporal Laplacian pyramid provides an effective technique for multi-scale analysis of video sequences, and spatio-temporal patterns with different scales could be well localized and captured. To efficiently explore the underlying local spatio-temporal orientation structures at multiple scales, a bank of three-dimensional separable steerable filters are conducted on each of the sub-volume from the Laplacian pyramid. The outputs of the quadrature pair of steerable filters are squared and summed to yield a more robust oriented energy representation. To be further invariant and compact, a spatio-temporal max pooling operation is performed between responses of the filtering at adjacent scales and over spatio-temporal neighbourhoods. In order to capture the appearance, local geometric structure and motion of an action, we apply the STSP on the intensity, 3D gradients and optical flow of video sequences, yielding a unified holistic representation of human actions.

Taking advantage of multi-scale, multi-orientation analysis and feature pooling, STSP produces a compact but informative and invariant representation of human actions. We conduct extensive experiments on the KTH, UCF Sports and HMDB51 datasets, which shows the unified STSP achieves comparable results with the state-of-the-art methods.

Introduction

Human action recognition [22], [2], [9] has been extensively researched in computer vision. Its potential applications can be found in many areas such as visual surveillance, video indexing/retrieval, sport event analysis and human computer interaction. However, action recognition is a challenging task mainly due to difficulties including large intra-class variations (i.e., the same action performed by different actors would differ significantly) and inter-class similarities (e.g., ‘running’ and ‘jogging’ appear rather similar). Existing action recognition systems are mainly focused on local and holistic representations.

Local representations using sparsely detected spatio-temporal interest points (STIPs) have dominated in human action recognition in the last decade. The popularity of local methods, e.g., the bag-of-words (BoW) model, results from attractive advantages, such as being less sensitive to partial occlusions and clutter and requiring no background subtraction or target tracking used in most of holistic methods. Nevertheless, local methods also suffer from some limitations, one of which is the inability to capture adequate spatial and temporal structure information of actions.

Holistic representations directly extract spatio-temporal features from raw video sequences rather than applying sparse sampling by STIP detectors. The advantage of such representations is that they are supplied with the entire spatial and temporal structural information of the human action in a sequence. However, to prevent the interference of background variations, accurate preprocessing steps such as background subtraction, segmentation and tracking are usually required.

In both local and holistic representations, low-level features serve as the fundamental role in human action representations. These features appear in the spatio-temporal volumes at arbitrary orientations and carry important features of actions, e.g., appearance and motion. To detect such features, a set of oriented filter kernels are always applied to each possible orientation, which however would be computationally expensive. Oriented filters have been often used in image processing. Features based on oriented gradients have been widely used and successfully extended from the image domain into video analysis and action recognition [8], [27]. In the image domain, Freeman et al. [15] proposed steerable filters to efficiently synthesize filters of arbitrary orientations for linear combinations of basis filters.

Adelson and Bergen [1] introduced a class of models for analysis of human motion mechanisms in which the first stage consists of linear filters that are oriented in space–time and tuned in spatial frequency. The outputs of the quadrature pairs of such filters are squared and summed to give a measure of motion energy. Energy models can be built from elements that are consistent with known physiology and psychophysics, and they permit a qualitative understanding of a variety of motion phenomena.

Extracting oriented spatio-temporal features based on steerable filters for video analysis has been well researched in previous works [49], [10], [12], [11], [6]. The use of steerable filters for the spatio-temporal data analysis can date back to the work by Wildes and Bergen [49]. They provided an avenue to perform qualitative analysis of spatio-temporal patterns that capture the underlying salient structures in video sequences. Local energy representations based on the quadrature outputs of the steerable filters were also used in their work, which is deemed as the foundation for analyzing spatio-temporal data using steerable filters.

By extending the two-dimensional steerable filters into three dimensions, Derpanis and Gryn [10] described the details of the construction of the Nth derivative of Gaussian separable steerable filters in the three-dimensional space. The separable and steerable implementations lead to efficient computation of steerable filters.

In light of the previous work, the local oriented energy representations have been utilized to spatio-temporal grouping [12], efficient action spotting [11] and visual tracking [6]. Derpanis and Wildes [12] adopted the oriented energy representation for grouping raw image data into a set of coherent spatiotemporal regions. This representation describes the presence of particular oriented spatio-temporal structures in a distributed manner to capture multiple oriented structures at a given location. They further designed a descriptor based on the oriented energy measurements for action spotting [11]. Slightly different from [12], in [11] the local energies are calculated based on a third order Gaussian derivative rather than the quadrature outputs of the steerable filters.

In the same spirit, Cannons et al. [6] developed a pixel-wise spatio-temporal oriented energy representation for visual tracking. Distinguished from [12], [11], a multi-scale Gaussian steerable filter was used. The representation includes appearance and motion information as well as information about how these descriptors are spatially arranged.

Our work is motivated by the fact that a video sequence with motion could be represented as a single pattern in the X–Y–T space, in which a velocity of motion corresponds to a three-dimensional orientation in this space. Motion information can be extracted by a system that responds to the oriented spatiotemporal energy. In addition, spatio-temporal features reside in different scales and can be extracted by multi-scale analysis. The steerable filters can efficiently perform multiple orientation analysis for videos while the Laplacian pyramid provides an effective multi-scale analysis. By combining the Laplacian pyramid and steerable filters, the STSP can detect non-orthogonal and over-complete features, which also shows the desirable property of shift and rotation invariance. It is a transform that combines the multi-scale decomposition with differential measurements, capturing the oriented structures in spatio-temporal volumes.

Inspired by the success of steerable filters in object classification [32] and video analysis [49], we introduce a novel holistic representation based on the spatio-temporal steerable pyramid (STSP) for action recognition. In contrast to previous holistic methods, our method based on the STSP can to a large extent handle the deficits of holistic representations and provides an informative, compact representation of human actions.

Note that this paper is an extension of the work in [54]. In the current version, we generalize the STSP by extending it from intensity to gradients and optical flow, and comprehensive experiments on the investigation of parameter settings are conducted on more datasets.

Given a 3D volume, which in our case can be the intensity volume, optical flows and 3D gradients of a video sequence, a spatio-temporal Laplacian pyramid structure is first constructed. The volume is decomposed into a set of sub-band volumes, which can segregate and enhance spatio-temporal features residing in different scales.

To efficiently explore oriented patterns in video sequences, a bank of spatio-temporal steerable filters with different scales is then applied to each level of the obtained Laplacian pyramid. These filters are separable, steerable filters in three dimensions (X–Y–T) and therefore can be computed efficiently.

Motivated by the previous work, we employ a representation based on spatio-temporal local energies which are calculated from the quadrature pairs of responses of the filtering on voxels in each volume.

Finally, a feature pooling operation, i.e., max pooling, is performed between adjacent scales of the steerable filters and over local spatio-temporal neighbourhoods, which makes the final representation more robust and invariant to scaling and shifts. In addition, features become more compact after the max pooling. The flowchart of feature extraction is illustrated in Fig. 1.

The contributions of the proposed method can be summarized as follows: (1) a new model based on spatio-temporal steerable pyramid is proposed for action recognition; (2) local oriented energies as spatio-temporal features are first employed for holistic representation of human actions; (3) a spatio-temporal max pooling operation is proposed to adopt into the spatio-temporal steerable pyramid model, which leads to a robust and more compact representation.

We organize the rest of the paper as follows: in Section 2 related work is reviewed and we describe details of our method in Section 3, and show experimental results in Section 4. Finally, we conclude in Section 5.

Section snippets

Related work

Since the pioneer works in [40], [13], [26], local features, i.e., spatio-temporal interest points (STIPs), in conjunction with the Bag-of-Words (BoW) model, have become popular for the local representation of human actions. Effective local feature detectors and descriptors have been designed and served as the basis for local methods.

Local representation is less sensitive to partial occlusions and clutter and can avoid some preliminary steps used in holistic methods such as background

Feature extraction

The theories of multi-scale representation and orientation analysis have been widely researched and used for image and video analysis. We instantiate them in video analysis by combining the Laplacian pyramid and steerable filters, and propose the spatio-temporal steerable pyramid (STSP) framework for the holistic representation of human actions.

In order to fully exploit the information residing in a video sequence with actions occurring in it, we apply spatio-temporal steerable filters on

Experiments and results

We evaluate the proposed method, i.e., STSP, on the KTH dataset, the UCF Sports and the newly released HMDB51 dataset. Sample frames from the three datasets are illustrated in Fig. 5. A linear support vector machine (SVM) is used for action classification [7].

Conclusion

In this paper, we have introduced an efficient holistic representation, named spatio-temporal steerable pyramid (STSP), for human action recognition. By decomposing a video sequence with a Laplacian pyramid, spatio-temporal salient features with various sizes can be well localized and enhanced. Multi-scale steerable filters can efficiently extract features in multiple scales and orientations. The spatio-temporal max pooling operation makes features further compact but invariant and robust.

Acknowledgements

The authors acknowledge the support of the University of Sheffield, the China Scholarship Council (CSC), the National Natural Science Foundation of China (Grant No: 61125106), and Shaanxi Key Innovation Team of Science and Technology (Grant No: 2012KCT-04).

References (55)

  • X. Deng et al.

    LF-EME: local features with elastic manifold embedding for human action recognition

    Neurocomputing

    (2012)
  • K. Derpanis, J. Gryn, Three-dimensional nth derivative of gaussian separable steerable filters, in: IEEE Conference on...
  • K. Derpanis, M. Sizintsev, K. Cannons, R. Wildes, Efficient action spotting based on a spacetime oriented structure...
  • K. Derpanis, R. Wildes, Early spatiotemporal grouping with a distributed oriented energy representation, in: IEEE...
  • P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in: 2nd Joint...
  • A.A. Efros, A.C. Berg, G. Mori, J. Malik, Recognizing action at a distance, in: IEEE International Conference on...
  • W. Freeman et al.

    The design and use of steerable filters

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1991)
  • L. Gorelick et al.

    Actions as space-time shapes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2007)
  • G. Hinton et al.

    A fast learning algorithm for deep belief nets

    Neural Comput.

    (2006)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • H. Jhuang, T. Serre, L. Wolf, T. Poggio, A biologically inspired system for action recognition, in: IEEE International...
  • S. Ji, W. Xu, M. Yang, K. Yu, 3d Convolutional neural networks for human action recognition, in: International...
  • Y.-G. Jiang et al.

    Trajectory-based modeling of human actions with motion reference points

  • O. Kliper-Gross et al.

    Motion interchange patterns for action recognition in unconstrained videos

  • A. Kovashka, K. Grauman, Learning a hierarchy of discriminative space-time neighborhood features for human action...
  • H. Kuehne, H. Poggio, T. Serre, HMDB: a large video database for human action recognition, in: IEEE International...
  • I. Laptev, T. Lindeberg, Space-time interest points, in: IEEE International Conference on Computer Vision,...
  • Cited by (49)

    • Video-based bird posture recognition using dual feature-rates deep fusion convolutional neural network

      2022, Ecological Indicators
      Citation Excerpt :

      The automatic recognition of bird behavior based on video and deep learning has incomparable advantages and potential because of the broad application of monitoring systems in bird breeding and protection. The methods based on deep learning extract rich feature information from video clips to realize the goal of recognition in which the feature fusion of different rates and the spatio-temporal features are effective ways to improve behavior recognition performance (Zhen et al., 2014; Yang et al., 2020). Therefore, we extract the spatio-temporal features and fuse the different rate features of behavior to emphasize the different movement behaviors of birds.

    • From handcrafted to learned representations for human action recognition: A survey

      2016, Image and Vision Computing
      Citation Excerpt :

      Also, since global approaches normally take the whole action video as the framework input and perform pooling or filtering on the video pixel level, it always results in a relatively simple architecture and thus requires less computational costs. Zhen et al. [53] propose a spatio-temporal steerable pyramid (STSP) action representation. By decomposing spatio-temporal volumes into band-passed sub-volumes, spatio-temporal patterns can present in multiple pyramid scales.

    View all citing articles on Scopus
    View full text