Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition

Zhu, Yan; Zhao, Xu; Fu, Yun; Liu, Yuncai

doi:10.1007/978-3-642-19309-5_51

Yan Zhu¹⁹,
Xu Zhao¹⁹,
Yun Fu²⁰ &
…
Yuncai Liu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 6493))

Included in the following conference series:

Asian Conference on Computer Vision

4491 Accesses
25 Citations

Abstract

By extracting local spatial-temporal features from videos, many recently proposed approaches for action recognition achieve promising performance. The Bag-of-Words (BoW) model is commonly used in the approaches to obtain the video level representations. However, BoW model roughly assigns each feature vector to its closest visual word, therefore inevitably causing nontrivial quantization errors and impairing further improvements on classification rates. To obtain a more accurate and discriminative representation, in this paper, we propose an approach for action recognition by encoding local 3D spatial-temporal gradient features within the sparse coding framework. In so doing, each local spatial-temporal feature is transformed to a linear combination of a few “atoms” in a trained dictionary. In addition, we also investigate the construction of the dictionary under the guidance of transfer learning. We collect a large set of diverse video clips of sport games and movies, from which a set of universal atoms composed of the dictionary are learned by an online learning strategy. We test our approach on KTH dataset and UCF sports dataset. Experimental results demonstrate that our approach outperforms the state-of-art techniques on KTH dataset and achieves the comparable performance on UCF sports dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE CVPR (2008)
Google Scholar
Kläser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: British Machine Vision Conference (2008)
Google Scholar
Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descriptor and its application to action recognition. In: ACM Multimedia, pp. 357–360 (2007)
Google Scholar
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. IJCV 79, 299–318 (2008)
Article Google Scholar
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM approach. In: ICPR, pp. 32–36 (2004)
Google Scholar
Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE CVPR (2010)
Google Scholar
Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, 1817–1853 (2005)
MathSciNet MATH Google Scholar
Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.: Self-taught learning: transfer learning from unlabeled data. In: International Conference on Machine Learning, pp. 759–766. ACM, New York (2007)
Google Scholar
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Google Scholar
Laptev, I.: On space-time interest points. IJCV 64, 107–123 (2005)
Article Google Scholar
Wong, S.F., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: IEEE ICCV (2007)
Google Scholar
Dikmen, M., Lin, D., Del Pozo, A., Cao, L., Fu, Y., Huang, T.S.: A study on sampling strategies in space-time domain for recognition applications. In: Advances in Multimedia Modeling, pp. 465–476 (2010)
Google Scholar
Wang, H., Ullah, M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference (2009)
Google Scholar
Grosse, R., Raina, R., Kwong, H., Ng, A.Y.: Shift-invariant sparse coding for audio classification. In: UAI (2007)
Google Scholar
Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: International Conference on Machine Learning, pp. 689–696. ACM, New York (2009)
Google Scholar
Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Discriminative learned dictionaries for local image analysis. In: IEEE CVPR (2008)
Google Scholar
Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: IEEE CVPR (2009)
Google Scholar
Yang, J., Yu, K., Huang, T.S.: Supervised translation-invariant sparse coding. In: IEEE CVPR (2010)
Google Scholar
Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research 37, 3311–3325 (1997)
Article Google Scholar
Liu, Y., Cheng, J., Xu, C., Lu, H.: Building topographic subspace model with transfer learning for sparse representation. Neurocomputing 73, 1662–1668 (2010)
Article Google Scholar
Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE T-PAMI 29, 411–426 (2007)
Article Google Scholar
Taylor, G., Bregler, C.: Learning local spatio-temporal features for activity recognition. In: Snowbird Learning Workshop (2010)
Google Scholar
Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE CVPR (2008)
Google Scholar
Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: IEEE ICCV (2007)
Google Scholar
Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: IEEE CVPR (2008)
Google Scholar
Bregonzio, M., Gong, S., Xiang, T.: Recognising action as clouds of space-time interest points. In: IEEE CVPR (2009)
Google Scholar
Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: IEEE ICCV (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, 200240, China
Yan Zhu, Xu Zhao & Yuncai Liu
Department of CSE, SUNY, Buffalo, NY, 14260, USA
Yun Fu

Authors

Yan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yun Fu
View author publications
You can also search for this author in PubMed Google Scholar
Yuncai Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Technion, Israel Institute of Technology, 32000, Haifa, Israel
Ron Kimmel
The University of Auckland, 37 Kohimarama Road, Mission Bay, 1071, Auckland, New Zealand
Reinhard Klette
National Institute of Informatics, 1018430, Chiyoda, Tokyo, Japan
Akihiro Sugimoto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Y., Zhao, X., Fu, Y., Liu, Y. (2011). Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition. In: Kimmel, R., Klette, R., Sugimoto, A. (eds) Computer Vision – ACCV 2010. ACCV 2010. Lecture Notes in Computer Science, vol 6493. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19309-5_51

Download citation

DOI: https://doi.org/10.1007/978-3-642-19309-5_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19308-8
Online ISBN: 978-3-642-19309-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics