Skip to main content

Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition

  • Conference paper
Computer Vision – ACCV 2010 (ACCV 2010)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 6493))

Included in the following conference series:

Abstract

By extracting local spatial-temporal features from videos, many recently proposed approaches for action recognition achieve promising performance. The Bag-of-Words (BoW) model is commonly used in the approaches to obtain the video level representations. However, BoW model roughly assigns each feature vector to its closest visual word, therefore inevitably causing nontrivial quantization errors and impairing further improvements on classification rates. To obtain a more accurate and discriminative representation, in this paper, we propose an approach for action recognition by encoding local 3D spatial-temporal gradient features within the sparse coding framework. In so doing, each local spatial-temporal feature is transformed to a linear combination of a few “atoms” in a trained dictionary. In addition, we also investigate the construction of the dictionary under the guidance of transfer learning. We collect a large set of diverse video clips of sport games and movies, from which a set of universal atoms composed of the dictionary are learned by an online learning strategy. We test our approach on KTH dataset and UCF sports dataset. Experimental results demonstrate that our approach outperforms the state-of-art techniques on KTH dataset and achieves the comparable performance on UCF sports dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE CVPR (2008)

    Google Scholar 

  2. Kläser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: British Machine Vision Conference (2008)

    Google Scholar 

  3. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descriptor and its application to action recognition. In: ACM Multimedia, pp. 357–360 (2007)

    Google Scholar 

  4. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. IJCV 79, 299–318 (2008)

    Article  Google Scholar 

  5. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM approach. In: ICPR, pp. 32–36 (2004)

    Google Scholar 

  6. Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE CVPR (2010)

    Google Scholar 

  7. Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, 1817–1853 (2005)

    MathSciNet  MATH  Google Scholar 

  8. Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.: Self-taught learning: transfer learning from unlabeled data. In: International Conference on Machine Learning, pp. 759–766. ACM, New York (2007)

    Google Scholar 

  9. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)

    Google Scholar 

  10. Laptev, I.: On space-time interest points. IJCV 64, 107–123 (2005)

    Article  Google Scholar 

  11. Wong, S.F., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: IEEE ICCV (2007)

    Google Scholar 

  12. Dikmen, M., Lin, D., Del Pozo, A., Cao, L., Fu, Y., Huang, T.S.: A study on sampling strategies in space-time domain for recognition applications. In: Advances in Multimedia Modeling, pp. 465–476 (2010)

    Google Scholar 

  13. Wang, H., Ullah, M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference (2009)

    Google Scholar 

  14. Grosse, R., Raina, R., Kwong, H., Ng, A.Y.: Shift-invariant sparse coding for audio classification. In: UAI (2007)

    Google Scholar 

  15. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: International Conference on Machine Learning, pp. 689–696. ACM, New York (2009)

    Google Scholar 

  16. Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Discriminative learned dictionaries for local image analysis. In: IEEE CVPR (2008)

    Google Scholar 

  17. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: IEEE CVPR (2009)

    Google Scholar 

  18. Yang, J., Yu, K., Huang, T.S.: Supervised translation-invariant sparse coding. In: IEEE CVPR (2010)

    Google Scholar 

  19. Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research 37, 3311–3325 (1997)

    Article  Google Scholar 

  20. Liu, Y., Cheng, J., Xu, C., Lu, H.: Building topographic subspace model with transfer learning for sparse representation. Neurocomputing 73, 1662–1668 (2010)

    Article  Google Scholar 

  21. Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE T-PAMI 29, 411–426 (2007)

    Article  Google Scholar 

  22. Taylor, G., Bregler, C.: Learning local spatio-temporal features for activity recognition. In: Snowbird Learning Workshop (2010)

    Google Scholar 

  23. Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE CVPR (2008)

    Google Scholar 

  24. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: IEEE ICCV (2007)

    Google Scholar 

  25. Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: IEEE CVPR (2008)

    Google Scholar 

  26. Bregonzio, M., Gong, S., Xiang, T.: Recognising action as clouds of space-time interest points. In: IEEE CVPR (2009)

    Google Scholar 

  27. Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: IEEE ICCV (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhu, Y., Zhao, X., Fu, Y., Liu, Y. (2011). Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition. In: Kimmel, R., Klette, R., Sugimoto, A. (eds) Computer Vision – ACCV 2010. ACCV 2010. Lecture Notes in Computer Science, vol 6493. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19309-5_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19309-5_51

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19308-8

  • Online ISBN: 978-3-642-19309-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics