Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

Zhang, Peihao; Tan, Xiaoyang; Jin, Xin

doi:10.1007/978-3-319-16865-4_31

Peihao Zhang⁵,
Xiaoyang Tan⁵ &
Xin Jin⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9003))

Included in the following conference series:

Asian Conference on Computer Vision

2032 Accesses

Abstract

In this paper, we present a new method which estimates the pose of a human body and identifies its action from one single static image. This is a challenging task due to the high degrees of freedom of body poses and lack of any motion cues. Specifically, we build a pool of pose experts, each of which individually models a particular type of articulation for a group of human bodies with similar poses or semantics (actions). We investigate two ways to construct these pose experts and show that this method leads to improved pose estimation performance under difficult conditions. Furthermore, in contrast to previous wisdoms of combining the output of each pose expert for action recognition using such method as majority voting, we propose a flexible strategy which adaptively integrates them in a discriminative framework, allowing each pose expert to adjust their roles in action prediction according to their specificity when facing different action types. In particular, the spatial relationship between estimated part locations from each expert is encoded in a graph structure, capturing both the non-local and local spatial correlation of the body shape. Each graph is then treated as a separate group, on which an overall group sparse constraint is imposed to train the prediction model, with extra weight added according to the confidence of the corresponding expert. We show in our experiments on a challenging web data set with state of the art results that our method effectively improves the tolerance of our system to imperfect pose estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sharma, G., Jurie, F., Schmid, C.: Expanded parts model for human attribute and action recognition in still images. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–659. IEEE (2013)
Google Scholar
Sener, F., Bas, C., Ikizler-Cinbis, N.: On Recognizing Actions in Still Images via Multiple Features. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012 Ws/Demos, Part III. LNCS, vol. 7585, pp. 263–272. Springer, Heidelberg (2012)
Google Scholar
Wang, Y., Tran, D., Liao, Z., Forsyth, D.: Discriminative hierarchical part-based models for human parsing and action recognition. J. Mach. Learn. Res. 13, 3075–3102 (2012)
MathSciNet MATH Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103, 60–79 (2013)
Article MathSciNet Google Scholar
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2008, pp. 1–8. IEEE (2008)
Google Scholar
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79, 299–318 (2008)
Article Google Scholar
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: 2004 Proceedings of the 17th International Conference on Pattern Recognition. ICPR 2004. vol. 3, pp. 32–36. IEEE (2004)
Google Scholar
Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1385–1392. IEEE (2011)
Google Scholar
Yang, W., Wang, Y., Mori, G.: Recognizing human actions from still images with latent poses. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2030–2037. IEEE (2010)
Google Scholar
Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: The 2010 British Machine Vision Conference. vol. 2, pp. 1–11 (2010)
Google Scholar
Maji, S., Bourdev, L., Malik, J.: Action recognition from a distributed representation of pose and appearance. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3177–3184. IEEE (2011)
Google Scholar
Sharma, G., Jurie, F.: Learning discriminative spatial representation for image classification. In: Hoey, J., McKenna, S.J., Trucco, E., (eds.) The 2011 British Machine Vision Conference, pp. 1–11. BMVA Press (2011)
Google Scholar
Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for static human-object interactions. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 9–16. IEEE (2010)
Google Scholar
Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1775–1789 (2009)
Article Google Scholar
Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17–24. IEEE (2010)
Google Scholar
Lan, T., Wang, Y., Yang, W., Mori, G.: Beyond actions: Discriminative models for contextual group activities. NIPS. 4321, 4322–4325 (2010)
Google Scholar
Ikizler-Cinbis, N., Cinbis, R.G., Sclaroff, S.: Learning actions from the web. In: the IEEE 12th International Conference on Computer Vision (CVPR 2009), pp. 995–1002. IEEE (2009)
Google Scholar
Sheikh, Y., Sheikh, M., Shah, M.: Exploring the space of a human action. In: The Tenth IEEE International Conference on Computer Vision (ICCV 2005). vol. 1, pp. 144–149. IEEE (2005)
Google Scholar
Ramanan, D., Forsyth, D. A.: Automatic annotation of everyday movements. In: Advances in neural information processing systems (2003)
Google Scholar
Jiang, Y.-G., Dai, Q., Xue, X., Liu, W., Ngo, C.-W.: Trajectory-Based Modeling of Human Actions with Motion Reference Points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 425–438. Springer, Heidelberg (2012)
Chapter Google Scholar
Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1234–1241. IEEE (2012)
Google Scholar
Xia, L., Chen, C.C., Aggarwal, J.: View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 20–27. IEEE (2012)
Google Scholar
Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3D R transform on spatio-temporal interest points for action recognition. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 724–730. IEEE (2013)
Google Scholar
Zhu, Y., Chen, W., Guo, G.: Fusing spatiotemporal features and joints for 3d action recognition. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 486–491. IEEE (2013)
Google Scholar
Ramanan, D.: Learning to parse images of articulated bodies. In: NIPS: The Twentieth Annual Conference on Neural Information Processing Systems, 4–7 December, 2006, Vancouver, Canada. vol. 19, pp. 1129–1136. MIT Press (2006)
Google Scholar
Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. International Journal of Computer Vision 61, 55–79 (2005)
Article Google Scholar
Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2878–2890 (2013)
Article Google Scholar
Chen, M., Tan, X.: Part-based pose estimation with local and non-local contextual information. IET Computer Vision, 1–12 (2014)
Google Scholar
Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1365–1372. IEEE (2009)
Google Scholar
Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24, 509–522 (2002)
Article Google Scholar
Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University (2009)
Google Scholar
Tran, D., Forsyth, D.: Improved Human Parsing with a Full Relational Model. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 227–240. Springer, Heidelberg (2010)
Chapter Google Scholar
Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and articulated pose estimation. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2009, pp. 1014–1021. IEEE (2009)
Google Scholar
Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 588–595. IEEE (2013)
Google Scholar

Download references

Acknowledgement

The authors want to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Science Foundation of China (61073112, 61035003, 61373060), Jiangsu Science Foundation (BK2012793), Qing Lan Project, Research Fund for the Doctoral Program (RFDP) (20123218110033).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
Peihao Zhang, Xiaoyang Tan & Xin Jin

Authors

Peihao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyang Tan
View author publications
You can also search for this author in PubMed Google Scholar
Xin Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoyang Tan .

Editor information

Editors and Affiliations

Technische Universität München, Garching, Bayern, Germany
Daniel Cremers
University of Adelaide, Adelaide, South Australia, Australia
Ian Reid
Keio University, Yokohama, Kanagawa, Japan
Hideo Saito
University of California at Merced, Merced, California, USA
Ming-Hsuan Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, P., Tan, X., Jin, X. (2015). Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts. In: Cremers, D., Reid, I., Saito, H., Yang, MH. (eds) Computer Vision – ACCV 2014. ACCV 2014. Lecture Notes in Computer Science(), vol 9003. Springer, Cham. https://doi.org/10.1007/978-3-319-16865-4_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-16865-4_31
Published: 16 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16864-7
Online ISBN: 978-3-319-16865-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics