Skip to main content

Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

  • Conference paper
  • First Online:
Computer Vision – ACCV 2014 (ACCV 2014)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9003))

Included in the following conference series:

  • 2032 Accesses

Abstract

In this paper, we present a new method which estimates the pose of a human body and identifies its action from one single static image. This is a challenging task due to the high degrees of freedom of body poses and lack of any motion cues. Specifically, we build a pool of pose experts, each of which individually models a particular type of articulation for a group of human bodies with similar poses or semantics (actions). We investigate two ways to construct these pose experts and show that this method leads to improved pose estimation performance under difficult conditions. Furthermore, in contrast to previous wisdoms of combining the output of each pose expert for action recognition using such method as majority voting, we propose a flexible strategy which adaptively integrates them in a discriminative framework, allowing each pose expert to adjust their roles in action prediction according to their specificity when facing different action types. In particular, the spatial relationship between estimated part locations from each expert is encoded in a graph structure, capturing both the non-local and local spatial correlation of the body shape. Each graph is then treated as a separate group, on which an overall group sparse constraint is imposed to train the prediction model, with extra weight added according to the confidence of the corresponding expert. We show in our experiments on a challenging web data set with state of the art results that our method effectively improves the tolerance of our system to imperfect pose estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sharma, G., Jurie, F., Schmid, C.: Expanded parts model for human attribute and action recognition in still images. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–659. IEEE (2013)

    Google Scholar 

  2. Sener, F., Bas, C., Ikizler-Cinbis, N.: On Recognizing Actions in Still Images via Multiple Features. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012 Ws/Demos, Part III. LNCS, vol. 7585, pp. 263–272. Springer, Heidelberg (2012)

    Google Scholar 

  3. Wang, Y., Tran, D., Liao, Z., Forsyth, D.: Discriminative hierarchical part-based models for human parsing and action recognition. J. Mach. Learn. Res. 13, 3075–3102 (2012)

    MathSciNet  MATH  Google Scholar 

  4. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103, 60–79 (2013)

    Article  MathSciNet  Google Scholar 

  5. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2008, pp. 1–8. IEEE (2008)

    Google Scholar 

  6. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79, 299–318 (2008)

    Article  Google Scholar 

  7. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: 2004 Proceedings of the 17th International Conference on Pattern Recognition. ICPR 2004. vol. 3, pp. 32–36. IEEE (2004)

    Google Scholar 

  8. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1385–1392. IEEE (2011)

    Google Scholar 

  9. Yang, W., Wang, Y., Mori, G.: Recognizing human actions from still images with latent poses. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2030–2037. IEEE (2010)

    Google Scholar 

  10. Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: The 2010 British Machine Vision Conference. vol. 2, pp. 1–11 (2010)

    Google Scholar 

  11. Maji, S., Bourdev, L., Malik, J.: Action recognition from a distributed representation of pose and appearance. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3177–3184. IEEE (2011)

    Google Scholar 

  12. Sharma, G., Jurie, F.: Learning discriminative spatial representation for image classification. In: Hoey, J., McKenna, S.J., Trucco, E., (eds.) The 2011 British Machine Vision Conference, pp. 1–11. BMVA Press (2011)

    Google Scholar 

  13. Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for static human-object interactions. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 9–16. IEEE (2010)

    Google Scholar 

  14. Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1775–1789 (2009)

    Article  Google Scholar 

  15. Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17–24. IEEE (2010)

    Google Scholar 

  16. Lan, T., Wang, Y., Yang, W., Mori, G.: Beyond actions: Discriminative models for contextual group activities. NIPS. 4321, 4322–4325 (2010)

    Google Scholar 

  17. Ikizler-Cinbis, N., Cinbis, R.G., Sclaroff, S.: Learning actions from the web. In: the IEEE 12th International Conference on Computer Vision (CVPR 2009), pp. 995–1002. IEEE (2009)

    Google Scholar 

  18. Sheikh, Y., Sheikh, M., Shah, M.: Exploring the space of a human action. In: The Tenth IEEE International Conference on Computer Vision (ICCV 2005). vol. 1, pp. 144–149. IEEE (2005)

    Google Scholar 

  19. Ramanan, D., Forsyth, D. A.: Automatic annotation of everyday movements. In: Advances in neural information processing systems (2003)

    Google Scholar 

  20. Jiang, Y.-G., Dai, Q., Xue, X., Liu, W., Ngo, C.-W.: Trajectory-Based Modeling of Human Actions with Motion Reference Points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 425–438. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  21. Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1234–1241. IEEE (2012)

    Google Scholar 

  22. Xia, L., Chen, C.C., Aggarwal, J.: View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 20–27. IEEE (2012)

    Google Scholar 

  23. Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3D R transform on spatio-temporal interest points for action recognition. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 724–730. IEEE (2013)

    Google Scholar 

  24. Zhu, Y., Chen, W., Guo, G.: Fusing spatiotemporal features and joints for 3d action recognition. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 486–491. IEEE (2013)

    Google Scholar 

  25. Ramanan, D.: Learning to parse images of articulated bodies. In: NIPS: The Twentieth Annual Conference on Neural Information Processing Systems, 4–7 December, 2006, Vancouver, Canada. vol. 19, pp. 1129–1136. MIT Press (2006)

    Google Scholar 

  26. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. International Journal of Computer Vision 61, 55–79 (2005)

    Article  Google Scholar 

  27. Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2878–2890 (2013)

    Article  Google Scholar 

  28. Chen, M., Tan, X.: Part-based pose estimation with local and non-local contextual information. IET Computer Vision, 1–12 (2014)

    Google Scholar 

  29. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1365–1372. IEEE (2009)

    Google Scholar 

  30. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24, 509–522 (2002)

    Article  Google Scholar 

  31. Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University (2009)

    Google Scholar 

  32. Tran, D., Forsyth, D.: Improved Human Parsing with a Full Relational Model. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 227–240. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  33. Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and articulated pose estimation. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2009, pp. 1014–1021. IEEE (2009)

    Google Scholar 

  34. Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 588–595. IEEE (2013)

    Google Scholar 

Download references

Acknowledgement

The authors want to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Science Foundation of China (61073112, 61035003, 61373060), Jiangsu Science Foundation (BK2012793), Qing Lan Project, Research Fund for the Doctoral Program (RFDP) (20123218110033).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoyang Tan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhang, P., Tan, X., Jin, X. (2015). Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts. In: Cremers, D., Reid, I., Saito, H., Yang, MH. (eds) Computer Vision – ACCV 2014. ACCV 2014. Lecture Notes in Computer Science(), vol 9003. Springer, Cham. https://doi.org/10.1007/978-3-319-16865-4_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16865-4_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16864-7

  • Online ISBN: 978-3-319-16865-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics