Skip to main content
Log in

Pose Adaptive Motion Feature Pooling for Human Action Analysis

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Ineffective spatial–temporal motion feature pooling has been a fundamental bottleneck for human action recognition/detection for decades. Previous pooling schemes such as global, spatial–temporal pyramid, or human and object centric pooling fail to capture discriminative motion patterns because informative movements only occur in specific regions of the human body, that depend on the type of action being performed. Global (holistic) motion feature pooling methods therefore often result in an action representation with limited discriminative capability. To address this fundamental limitation, we propose an adaptive motion feature pooling scheme that utilizes human poses as side information. Such poses can be detected for instance in assisted living and indoor smart surveillance scenarios. Taking both video sub-volumes for pooling and human pose types as hidden variables, we formulate the motion feature pooling problem as a latent structural learning problem where the relationship between the discriminative pooling video sub-volumes and the pose types is learned. The resulting pose adaptive motion feature pooling scheme is extensively tested on assisted living and smart surveillance datasets and on general action recognition benchmarks. Improved action recognition and detection performances are demonstrated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. We also offline test other spatial partition schemes including: (1)  vertical-4region-overlap scheme; (2)  vertical-3region-nonoverlapscheme; and (3)  vertical-3region-overlap/horizontal-2region-nonoverlap scheme (simply performing a horizontal cut in the middle on the current partition scheme used in this work). The results show that the overlapping partition scheme is better than the non-overlapping version and the six-region partition scheme (i.e., vertical-3region-overlap/horizontal-2region-nonoverlap scheme) only slightly outperforms the currently used three-region partition scheme, but with much higher computational cost. Therefore, in this work, we use the currentvertical-3region-overlap partition scheme, which is also naturally corresponding to the head-upper torso, torso, and lower torso-leg regions.

  2. We have offline tested our poselet key-framing implementation on the UT-Interaction dataset, our recognition result (accuracy on half videos) on that dataset is \(71.5\,\%\) which is comparable with the result reported in the original work (Raptis and Sigal 2013), i.e., \(73.3\,\%\). Note that the manual annotations in Raptis and Sigal (2013) are not available.

References

  • Andrews, S., & Tsochantaridis, I. (2003). Support vector machines for multiple instance learning. In: Advances in neural information processing systems (pp. 561–568). MIT Press.

  • Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3d human pose annotations. In: International conference on computer vision, URL http://www.eecs.berkeley.edu/~lbourdev/poselets

  • Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent System and Technology, 2(27), 1–27.

    Article  Google Scholar 

  • Chen, Q., Song, Z., Hua, Y., Huang, Z., & Yan, S. (2011). Hierarchical matching with side information for image classification. In: International conference on computer vision and pattern recognition.

  • Choi, J., Jeon, W.J., & Lee, S.C. (2008). Spatio-temporal pyramid matching for sports videos. In: ACM multimedia information retrieval.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In: International conference on computer vision and pattern recognition (pp. 886–893).

  • Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In: VS-PETS.

  • Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In: International conference on computer vision (pp. 1491–1498).

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Girshick, R.B., Felzenszwalb, P.F., & McAllester, D. (2012). Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/~rbg/latent-release5/

  • Yang, J., YG., Yu, K., Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In: International conference on computer vision and pattern recognition.

  • Jiang, Y., Yuan, J., Yu, G. (2012). Randomized spatial partition for scene recognition. In: European conference on computer vision.

  • Kanan, C., Cottrell, G. (2010). Robust classification of objects, faces, and flowers using natural image statistics. In: International conference on computer vision and pattern recognition.

  • Klaser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d gradients. In: British machine vision conference.

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In: International conference on computer vision.

  • Laptev, I., Lindeberg, T. (2003). Space-time interest points. In: International conference on computer vision.

  • Lazebnik, S., Schmid, C., Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: International conference on computer vision and pattern recognition.

  • Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and viterbi path searching. In: International conference computer vision and pattern recognition.

  • Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In: International conference on computer vision and pattern recognition (pp. 2929–2936). Retrieved June, 2009.

  • Ni, B., Wang, G., & Moulin, P. (2011). RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In: ICCV workshops (pp. 1147–1153).

  • Niebles, J.C., Chen, C.W., & Fei-fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In: European conference on computer vision (pp. 392–405).

  • Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In: European conference on computer vision (pp. 143–156).

  • Raptis, M., & Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In: International conference on computer vision and pattern recognition (pp. 2650–2657).

  • Raptis, M., Kokkinos, I., & Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In: International conference on computer vision and pattern recognition.

  • Russakovsky, O., Lin, Y., Yu, K., & Fei-Fei, L. (2012). Object-centric spatial pooling for image classification. In: European conference on computer vision.

  • Ryoo, M.S., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: International conference on computer vision (pp. 1593–1600).

  • Satkin, S., Hebert, M. (2010). Modeling the temporal extent of actions. In: European conference on computer vision (pp. 536–548).

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In: International conference on pattern recognition.

  • Shi, Q., Wang, L., Cheng, L., & Smola, A. (2011). Discriminative human action segmentation and recognition using semi-markov models. International Journal of Computer Vision, 93(1), 22–32.

    Article  MATH  Google Scholar 

  • Shimada, A., Kondo, K., Deguchi, D., Morin, G., & Stern, H. (2013). Kitchen scene context based gesture recognition: A contest in icpr2012. In: Advances in depth image analysis and applications. (vol. 7854), (pp. 168–185), URL http://www.murase.m.is.nagoya-u.ac.jp/KSCGR/index.html

  • Tang, K., Fei-fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In: International conference on computer vision and pattern recognition.

  • Vahdat, A., Gao, B., Ranjbar, M., & Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In: ICCV workshop (pp. 1729–1736).

  • Wang, G., & Forsyth, D. (2009). Joint learning of visual attributes, object classes and visual saliency. In: International conference on computer vision.

  • Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In: International conference on computer vision.

  • Wang, H., Kläser, A., Schmid, C., & Cheng-Lin, L. (2011). Action recognition by dense trajectories. In: International conference on computer vision and pattern recognition (pp. 3169–3176).

  • Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.

    Article  MathSciNet  Google Scholar 

  • Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In: International conference on computer vision and pattern recognition (pp. 1290–1297).

  • Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7), 1310–1323.

    Article  Google Scholar 

  • Wolf, C., Mille, J., Lombardi, L., Celiktutan, O., Jiu, M., Baccouche, M., Dellandrea, E., Bichot, C., Garcia, C., & Sankur, B. (2012). The liris human activities dataset and the icpr 2012 human activities recognition and localization competition. Technical report RR-LIRIS-2012-004, LIRIS laboratory, URL http://liris.cnrs.fr/harl2012/evaluation.html

  • Yakhnenko, O., & Verbeek, J. (2011). Region-based image classification with a latent SVM model. Technical report, INRIA.

  • Yamato, J., Ohya, J., & Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In: International conference on computer vision and pattern recognition (pp. 379–385).

  • Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9), 1728–1743.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bingbing Ni.

Additional information

Communicated by M. Hebert.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ni, B., Moulin, P. & Yan, S. Pose Adaptive Motion Feature Pooling for Human Action Analysis. Int J Comput Vis 111, 229–248 (2015). https://doi.org/10.1007/s11263-014-0742-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-014-0742-4

Keywords

Navigation