Abstract
We propose a novel approach to model spatio-temporal distribution of local features for action recognition in videos. The proposed approach is based on the Lie Algebrized Gaussians (LAG) which is a feature aggregation approach and yields high-dimensional video signature. In the framework of LAG, local features extracted from a video are aggregated to train a video-specific Gaussian Mixture Model (GMM). Then the video-specific GMM is encoded as a vector based on Lie group theory and this step is also referred to as GMM vectorization. As the video-specific GMM gives a soft partition of the feature space, for each cell of the feature space (i.e. each Gaussian component), we use a GMM to model the spatio-temporal locations of the local features assigned to the Gaussian component. The location GMMs are encoded as vectors just like the local feature GMM. We term those vectors of location GMMs spatio-temporal LAG (STLAG). In addition, although the LAG and the popular Fisher Vector (FV) are derived from distinct theory perspectives, we find that they are closely related. Hence the power and ℓ 2 normalization proposed for the FV are also beneficial to the LAG. Experimental results show that STLAG is very effective to model spatio-temporal layout compared with other techniques such as spatio-temporal pyramid and feature augmentation. Using the state-of-the-art dense trajectory features, our approach achieves state-of-the-art performance on two challenging datasets: Hollywood2 and HMDB51.
Similar content being viewed by others
Notes
Software available at http://lear.inrialpes.fr/~wang/improved_trajectories
References
Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: IEEE International Conference on Computer Vision
Boiman O, Shechtman E, Irani M (2008) In defense of nearest-neighbor based image classification. In: IEEE Conference on Computer Vision and Pattern Recognition
Cai Z, Wang L, Peng X, Qiao Y (2014) Multi-view super vector for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition
Chang C, Lin C (2011) LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(27):1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chen M, Gong L, Wang T, Feng Q (2015) Action recognition using lie algebrized gaussians over dense local spatio-temporal features. Multimedia Tools and Applications 74(6):2127–2142
Gong L, Chen M, Hu C (2013) Lie algebrized gaussians for image representation. arXiv:1304.0823v1[cs.CV]
Hu C, Gong L, Wang T, Feng Q (2015) Effective human age estimation using a two-stage approach based on lie algebrized gaussians feature. Multimedia Tools and Applications 74(11):4139–4159
Hu C, Gong L, Wang T, Liu F, Feng Q (2014) An effective head pose estimation approach using lie algebrized gaussians based face representation. Multimedia Tools and Applications 73(3):1863–1884
Huang Y, Wu Z, Wang L, Tan T (2014) Feature coding in image classification: A comprehensive study. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(3):493–506
Jain M, Jégou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition
Jegou H, Perronnin F, Douze M, Sanchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence
Kantorov V, Laptev I (2014) Efficient feature extraction, encoding, and classification for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition
Ken Chatfield Victor Lempitsky AV, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. In: British Machine Vision Conference
Kihl O, Picard D, Gosselin PH (2014) Local polynomial space-time descriptors for action classification. Machine Vision and Applications 1–11
Krapac J, Verbeek J, Jurie F (2011) Modeling spatial layout with fisher vectors for image categorization. In: IEEE International Conference on Computer Vision
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: A large video database for human motion recognition. In: IEEE International Conference on Computer Vision
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition
McCann S, Lowe D (2012) Spatially local coding for object recognition. In: Asian Conference on Computer Vision
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: IEEE International Conference on Computer Vision
Oneata D, Verbeek J, Schmid C (2014) Efficient action localization with approximately normalized fisher vectors. In: IEEE Conference on Computer Vision and Pattern Recognition
Peng X, Qiao Y, Peng Q (2014) Motion boundary based sampling and 3d co-occurrence descriptors for action recognition. Image Vis Comput 32(9):616–628
Peng X, Wang L, Qiao Y, Peng Q (2014) Boosting vlad with supervised dictionary learning and high-order statistics. In: European Conference on Computer Vision
Perronnin F (2008) Universal and adapted vocabularies for generic visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence
Perronnin F, Liu Y, Sanchez J, Poirier H (2010) Large-scale image retrieval with compressed fisher vectors. In: IEEE Conference on Computer Vision and Pattern Recognition
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: European Conference on Computer Vision
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10(1-3):19–41
Sánchez J, Perronnin F, De Campos T (2012) Modeling the spatial layout of images beyond spatial pyramids. Pattern Recogn Lett 33(16):2216–2223
Sanchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision
Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: A local svm approach. In: International Conference on Pattern Recognition
Sun L, Jia K, Chan TH, Fang Y, Wang G, Yan S (2014) Dl-sfa: Deeply-learned slow feature analysis for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision
Wang X, Wang L, Qiao Y (2012) A comparative study of encoding, pooling and normalization methods for action recognition. In: Asian Conference on Computer Vision
Wu J, Zhang Y, Lin W (2014) Towards good practices for action video encoding. In: IEEE Conference on Computer Vision and Pattern Recognition
Wu X, Xu D, Duan L, Luo J (2011) Action recognition using context and appearance distribution features. In: IEEE Conference on Computer Vision and Pattern Recognition
Yan S, Zhou X, Liu M, Hasegawa-Johnson M, Huang TS (2008) Regression from patch-kernel. In: IEEE Conference on Computer Vision and Pattern Recognition
Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition
Yang X, Tian Y (2014) Action recognition using super sparse coding vector with spatio-temporal awareness. In: European Conference on Computer Vision
Zhou X, Cui N, Li Z, Liang F, Huang TS (2009) Hierarchical gaussianization for image classification. In: IEEE International Conference on Computer Vision
Zhou X, Yu K, Zhang T, Huang TS (2010) Image classification using super-vector coding of local image descriptors. In: European Conference on Computer Vision
Zhou X, Zhuang X, Yan S, Chang S, Hasegawa-Johnson M, Huang TS (2008) Sift-bag kernel for video event analysis. In: ACM International Conference on Multimedia
Zhu J, Wang B, Yang X, Zhang W, Tu Z (2013) Action recognition with actons. In: IEEE International Conference on Computer Vision
Acknowledgements
This work was supported by grants from the National Natural Science Foundation of China (No.U1233119) and the Wuhan Key Science and Technology Project (No.2014010202010110).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, M., Gong, L., Wang, T. et al. Modeling spatio-temporal layout with Lie Algebrized Gaussians for action recognition. Multimed Tools Appl 75, 10335–10355 (2016). https://doi.org/10.1007/s11042-015-3008-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-3008-4