Abstract
Action detection is of great importance in understanding human motion from video. Compared with action recognition, it not only recognizes action type, but also localizes its spatiotemporal extent. This paper presents a relational model for action detection, which first decomposes human action into temporal “key poses” and then further into spatial “action parts”. Specifically, we start by clustering cuboids around each human joint into dynamic-poselets using a new descriptor. The cuboids from the same cluster share consistent geometric and dynamic structure, and each cluster acts as a mixture of body parts. We then propose a sequential skeleton model to capture the relations among dynamic-poselets. This model unifies the tasks of learning the composites of mixture dynamic-poselets, the spatiotemporal structures of action parts, and the local model for each action part in a single framework. Our model not only allows to localize the action in a video stream, but also enables a detailed pose estimation of an actor. We formulate the model learning problem in a structured SVM framework and speed up model inference by dynamic programming. We conduct experiments on three challenging action detection datasets: the MSR-II dataset, the UCF Sports dataset, and the JHMDB dataset. The results show that our method achieves superior performance to the state-of-the-art methods on these datasets.
Chapter PDF
References
Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Comput. Surv. 43(3), 16 (2011)
Bourdev, L.D., Maji, S., Malik, J.: Describing people: A poselet-based approach to attribute classification. In: ICCV (2011)
Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: ICCV (2011)
Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action detection. In: CVPR (2010)
Derpanis, K.G., Sizintsev, M., Cannons, K.J., Wildes, R.P.: Efficient action spotting based on a spacetime oriented structure representation. In: CVPR (2010)
Desai, C., Ramanan, D.: Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 158–172. Springer, Heidelberg (2012)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV (2013)
Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV (2007)
Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011)
Packer, B., Saenko, K., Koller, D.: A combined pose, object, and feature model for action understanding. In: CVPR (2012)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR abs/1405.4506 (2014)
Raptis, M., Kokkinos, I., Soatto, S.: Discovering discriminative action parts from mid-level video representations. In: CVPR (2012)
Raptis, M., Sigal, L.: Poselet key-framing: A model for human activity recognition. In: CVPR (2013)
Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)
Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: CVPR (2012)
Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: ICPR (2004)
Singh, V.K., Nevatia, R.: Action recognition in cluttered dynamic scenes using pose-specific part models. In: ICCV (2011)
Sun, C., Nevatia, R.: Active: Activity concept transitions in video event classification. In: ICCV (2013)
Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: CVPR (2013)
Tran, D., Yuan, J.: Max-margin structured output regression for spatio-temporal action localization. In: NIPS (2012)
Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and structured output spaces. In: ICML (2004)
Ullah, M.M., Laptev, I.: Actlets: A novel local representation for human action recognition in video. In: ICIP (2012)
Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: CVPR (2013)
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1) (2013)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)
Wang, L., Qiao, Y., Tang, X.: Mining motion atoms and phrases for complex action recognition. In: ICCV (2013)
Wang, L., Qiao, Y., Tang, X.: Motionlets: Mid-level 3D parts for human motion recognition. In: CVPR (2013)
Wang, L., Qiao, Y., Tang, X.: Latent hierarchical model of temporal structure for complex activity classification. TIP 23(2) (2014)
Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part III. LNCS, vol. 7726, pp. 572–585. Springer, Heidelberg (2013)
Yang, Y., Saleemi, I., Shah, M.: Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions. TPAMI 35(7) (2013)
Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR (2011)
Yao, A., Gall, J., Gool, L.J.V.: A Hough transform-based voting framework for action recognition. In: CVPR (2010)
Yu, G., Yuan, J., Liu, Z.: Propagative hough voting for human activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 693–706. Springer, Heidelberg (2012)
Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: CVPR (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, L., Qiao, Y., Tang, X. (2014). Video Action Detection with Relational Dynamic-Poselets. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-10602-1_37
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10601-4
Online ISBN: 978-3-319-10602-1
eBook Packages: Computer ScienceComputer Science (R0)