Skip to main content
Log in

Learning a discriminative mid-level feature for action recognition

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

In this paper, we address the problem of recognizing human actions from videos. Most of the existing approaches employ low-level features (e.g., local features and global features) to represent an action video. However, algorithms based on low-level features are not robust to complex environments such as cluttered background, camera movement and illumination change. Therefore, we propose a novel random forest learning framework to construct a discriminative and informative mid-level feature from low-level features of densely sampled 3D cuboids. Each cuboid is classified by the corresponding random forests with a novel fusion scheme, and the cuboid’s posterior probabilities of all categories are normalized to generate a histogram. After that, we obtain our mid-level feature by concatenating histograms of all the cuboids. Since a single low-level feature is not enough to capture the variations of human actions, multiple complementary low-level features (i.e., optical flow and histogram of gradient 3D features) are employed to describe 3D cuboids. Moreover, temporal context between local cuboids is exploited as another type of low-level feature. The above three low-level features (i.e., optical flow, histogram of gradient 3D features and temporal context) are effectively fused in the proposed learning framework. Finally, the mid-level feature is employed by a random forest classifier for robust action recognition. Experiments on the Weizmann, UCF sports, Ballet, and multi-view IXMAS datasets demonstrate that out mid-level feature learned from multiple low-level features can achieve a superior performance over state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Efros A A, Berg A C, Mori G, et al. Recognizing action at a distance. In: Proceedings of 9th IEEE Conference on Computer Vision (ICCV), Nice, 2003. 726–733

    Chapter  Google Scholar 

  2. Thurau C, Hlavac V. Pose primitive based human action recognition in videos or still images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, 2008. 1–8

    Google Scholar 

  3. Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, 2005. 886–893

    Google Scholar 

  4. Laptev I, Marszalek M, Schmid C, et al. Learning realistic human actions from movies. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, 2008. 1–8

    Google Scholar 

  5. Klaser A, Marszalek M, Schmid C. A spatio-temporal descriptor based on 3D-gradients. In: Proceedings of the British Machine Vision Conference (BMVC), Leeds, 2008. 1–10

    Google Scholar 

  6. Wang H, Ullah M M, Klaser A, et al. Evaluation of local spatio-temporal features for action recognition. In: Proceedings of the British Machine Vision Conference (BMVC), London, 2009. 1–11

    Google Scholar 

  7. Wu X X, Xu D, Duan L X, et al. Action recognition using context and appearance distribution features. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, 2011. 489–496

    Google Scholar 

  8. Liu J G, Ali S, Shah M. Recognizing human actions using multiple features. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, 2008. 1–8

    Google Scholar 

  9. Wang Y, Mori G. Max-margin hidden conditional random fields for human action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, 2009. 872–879

    Google Scholar 

  10. Han L, Wu X X, Liang W, et al. Discriminative human action recognition in the learned hierarchical manifold space. Image Vis Comput, 2010, 28: 836–849

    Article  Google Scholar 

  11. Fathi A, Mori G. Action recognition by learning mid-level motion features. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, 2008. 1–8

    Google Scholar 

  12. Niebles J C, Li F F. A hierarchical model of shape and appearance for human action classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, 2007. 1–8

    Google Scholar 

  13. Kong Y, Zhang X Q, Hu W M, et al. Adaptive learning codebook for action recognition. Pattern Recogn Lett, 2011, 32: 1178–1186

    Article  Google Scholar 

  14. Lu Z W, Peng Y X, Ip H H S. Spectral learning of latent semantics for action recognition. In: Proceedings of IEEE Conference on Computer Vision (ICCV), Barcelona, 2011. 1503–1510

    Google Scholar 

  15. Wang Y, Mori G. Hidden part models for human action recognition: probabilistic versus max-margin. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 1310–1323

    Article  Google Scholar 

  16. Niebles J C, Chen C W, Li F F. Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of the 11th European Conference on Computer Vision (ECCV), Heraklion, 2010. 392–405

    Google Scholar 

  17. Raptis M, Kokkinos I, Soatto S. Discovering discriminative action parts from mid-level video representations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, 2012. 1242–1249

    Google Scholar 

  18. Liu J G, Kuipers B, Savarese S. Recognizing human actions by attributes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, 2011. 3337–3344

    Google Scholar 

  19. Bosch A, Zisserman A, Muoz X. Image classification using random forests and ferns. In: Proceedings of IEEE Conference on Computer Vision (ICCV), Rio de Janeiro, 2007. 1–8

    Google Scholar 

  20. Yu G, Yuan J S, Liu Z C. Unsupervised random forest indexing for fast action search. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, 2011. 865–872

    Google Scholar 

  21. Shotton J, Fitzgibbon A, Cook M, et al. Real-time human pose recognition in parts from single depth images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New York, 2011. 116–124

    Google Scholar 

  22. Breiman L. Random forests. Mach Learn, 2001, 45: 5–32

    Article  MATH  Google Scholar 

  23. Lepetit V, Fua P. Keypoint recognition using randomized trees. IEEE Trans Pattern Anal Mach Intell, 2006, 28: 1465–1479

    Article  Google Scholar 

  24. Breiman L. Randomizing outputs to increase prediction accuracy. Mach Learn, 2000, 40: 229–242

    Article  MATH  Google Scholar 

  25. Blank M, Gorelick L, Shechtman E, et al. Actions as space-time shapes. In: Proceedings of 10th IEEE Conference on Computer Vision (ICCV), Beijing, 2005. 1395–1402

    Google Scholar 

  26. Rodriguez M D, Ahmed J, Shah M. Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, 2008. 1–8

    Google Scholar 

  27. Weinland D, Boyer E, Ronfard R. Action recognition from arbitrary views using 3D exemplars. In: Proceedings of IEEE Conference on Computer Vision (ICCV), Rio de Janeiro, 2007. 1–7

    Google Scholar 

  28. Wu X X, Jia Y D, Liang W. Incremental discriminant-analysis of canonical correlations for action recognition. Pattern Recogn, 2010, 43: 4190–4197

    Article  MATH  Google Scholar 

  29. Yao A, Gall J, Gool L V. A hough transform-based voting framework for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, 2010. 2061–2068

    Google Scholar 

  30. Wang H, Klaser A, Schmid C, et al. Action recognition by dense trajectories. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, 2011. 3169–3176

    Google Scholar 

  31. Kovashka A, Grauman K. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, 2010. 2046–2053

    Google Scholar 

  32. Junejo I N, Dexter E, Laptev I, et al. Cross-view action recognition from temporal self-similarities. In: Proceedings of the 10th European Conference on Computer Vision (ECCV), Mardi, 2008. 1–19

    Google Scholar 

  33. Liu J G, Shah M, Kuipers B, et al. Cross-view action recognition via view knowledge transfer. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, 2011. 3209–3216

    Google Scholar 

  34. Weinland D, Ozuysal M, Fua P. Making action recognition robust to occlusions and viewpoint changes. In: Proceedings of the 11th European Conference on Computer Vision (ECCV), Heraklion, 2010. 635–648

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to MingTao Pei.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, C., Pei, M., Wu, X. et al. Learning a discriminative mid-level feature for action recognition. Sci. China Inf. Sci. 57, 1–13 (2014). https://doi.org/10.1007/s11432-013-4938-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-013-4938-y

Keywords

Navigation