Abstract
In this paper, we present a new method which estimates the pose of a human body and identifies its action from one single static image. This is a challenging task due to the high degrees of freedom of body poses and lack of any motion cues. Specifically, we build a pool of pose experts, each of which individually models a particular type of articulation for a group of human bodies with similar poses or semantics (actions). We investigate two ways to construct these pose experts and show that this method leads to improved pose estimation performance under difficult conditions. Furthermore, in contrast to previous wisdoms of combining the output of each pose expert for action recognition using such method as majority voting, we propose a flexible strategy which adaptively integrates them in a discriminative framework, allowing each pose expert to adjust their roles in action prediction according to their specificity when facing different action types. In particular, the spatial relationship between estimated part locations from each expert is encoded in a graph structure, capturing both the non-local and local spatial correlation of the body shape. Each graph is then treated as a separate group, on which an overall group sparse constraint is imposed to train the prediction model, with extra weight added according to the confidence of the corresponding expert. We show in our experiments on a challenging web data set with state of the art results that our method effectively improves the tolerance of our system to imperfect pose estimation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sharma, G., Jurie, F., Schmid, C.: Expanded parts model for human attribute and action recognition in still images. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–659. IEEE (2013)
Sener, F., Bas, C., Ikizler-Cinbis, N.: On Recognizing Actions in Still Images via Multiple Features. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012 Ws/Demos, Part III. LNCS, vol. 7585, pp. 263–272. Springer, Heidelberg (2012)
Wang, Y., Tran, D., Liao, Z., Forsyth, D.: Discriminative hierarchical part-based models for human parsing and action recognition. J. Mach. Learn. Res. 13, 3075–3102 (2012)
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103, 60–79 (2013)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2008, pp. 1–8. IEEE (2008)
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79, 299–318 (2008)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: 2004 Proceedings of the 17th International Conference on Pattern Recognition. ICPR 2004. vol. 3, pp. 32–36. IEEE (2004)
Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1385–1392. IEEE (2011)
Yang, W., Wang, Y., Mori, G.: Recognizing human actions from still images with latent poses. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2030–2037. IEEE (2010)
Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: The 2010 British Machine Vision Conference. vol. 2, pp. 1–11 (2010)
Maji, S., Bourdev, L., Malik, J.: Action recognition from a distributed representation of pose and appearance. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3177–3184. IEEE (2011)
Sharma, G., Jurie, F.: Learning discriminative spatial representation for image classification. In: Hoey, J., McKenna, S.J., Trucco, E., (eds.) The 2011 British Machine Vision Conference, pp. 1–11. BMVA Press (2011)
Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for static human-object interactions. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 9–16. IEEE (2010)
Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1775–1789 (2009)
Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17–24. IEEE (2010)
Lan, T., Wang, Y., Yang, W., Mori, G.: Beyond actions: Discriminative models for contextual group activities. NIPS. 4321, 4322–4325 (2010)
Ikizler-Cinbis, N., Cinbis, R.G., Sclaroff, S.: Learning actions from the web. In: the IEEE 12th International Conference on Computer Vision (CVPR 2009), pp. 995–1002. IEEE (2009)
Sheikh, Y., Sheikh, M., Shah, M.: Exploring the space of a human action. In: The Tenth IEEE International Conference on Computer Vision (ICCV 2005). vol. 1, pp. 144–149. IEEE (2005)
Ramanan, D., Forsyth, D. A.: Automatic annotation of everyday movements. In: Advances in neural information processing systems (2003)
Jiang, Y.-G., Dai, Q., Xue, X., Liu, W., Ngo, C.-W.: Trajectory-Based Modeling of Human Actions with Motion Reference Points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 425–438. Springer, Heidelberg (2012)
Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1234–1241. IEEE (2012)
Xia, L., Chen, C.C., Aggarwal, J.: View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 20–27. IEEE (2012)
Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3D R transform on spatio-temporal interest points for action recognition. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 724–730. IEEE (2013)
Zhu, Y., Chen, W., Guo, G.: Fusing spatiotemporal features and joints for 3d action recognition. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 486–491. IEEE (2013)
Ramanan, D.: Learning to parse images of articulated bodies. In: NIPS: The Twentieth Annual Conference on Neural Information Processing Systems, 4–7 December, 2006, Vancouver, Canada. vol. 19, pp. 1129–1136. MIT Press (2006)
Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. International Journal of Computer Vision 61, 55–79 (2005)
Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2878–2890 (2013)
Chen, M., Tan, X.: Part-based pose estimation with local and non-local contextual information. IET Computer Vision, 1–12 (2014)
Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1365–1372. IEEE (2009)
Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24, 509–522 (2002)
Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University (2009)
Tran, D., Forsyth, D.: Improved Human Parsing with a Full Relational Model. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 227–240. Springer, Heidelberg (2010)
Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and articulated pose estimation. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2009, pp. 1014–1021. IEEE (2009)
Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 588–595. IEEE (2013)
Acknowledgement
The authors want to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Science Foundation of China (61073112, 61035003, 61373060), Jiangsu Science Foundation (BK2012793), Qing Lan Project, Research Fund for the Doctoral Program (RFDP) (20123218110033).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhang, P., Tan, X., Jin, X. (2015). Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts. In: Cremers, D., Reid, I., Saito, H., Yang, MH. (eds) Computer Vision – ACCV 2014. ACCV 2014. Lecture Notes in Computer Science(), vol 9003. Springer, Cham. https://doi.org/10.1007/978-3-319-16865-4_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-16865-4_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16864-7
Online ISBN: 978-3-319-16865-4
eBook Packages: Computer ScienceComputer Science (R0)