Skip to main content
Log in

A Hierarchical Video Description for Complex Activity Understanding

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper addresses the challenging problem of complex human activity understanding from long videos. Towards this goal, we propose a hierarchical description of an activity video, referring to the “which” of activities, “what” of atomic actions, and “when” of atomic actions happening in the video. In our work, each complex activity is characterized as a composition of simple motion units (called atomic actions), and different atomic actions are explained by different video segments. We develop a latent discriminative structural model to detect the complex activity and atomic actions, while learning the temporal structure of atomic actions simultaneously. A segment-annotation mapping matrix is introduced for relating video segments to their associational atomic actions, allowing different video segments to explain different atomic actions. The segment-annotation mapping matrix is treated as latent information in the model, since its ground-truth is not available during both training and testing. Moreover, we present a semi-supervised learning method to automatically predict the atomic action labels of unlabeled training videos when the labeled training data is limited, which could greatly alleviate the laborious and time-consuming annotations of atomic actions for training data. Experiments on three activity datasets demonstrate that our method is able to achieve promising activity recognition results and obtain rich and hierarchical descriptions of activity videos.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Bhattacharya, S., Kalayeh, M. M., Sukthankar, R. & Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In IEEE international conference on computer vision and pattern recognition (CVPR).

  • Do, T. M. T. & Artieres, T. (2009). Large margin training for hidden markov models with partially observed states. In IEEE international conference on machine learning (ICML).

  • Dollar, P., Rabaud, V., Cottrell, G. & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In VS PETS.

  • Efros, A.A., Berg, A.C., Mori, G. & Malik, J. (2003). Recognizing action at a distance. In IEEE international conference on computer vision (ICCV).

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 32(9), 1627–1645.

    Article  Google Scholar 

  • Gaidon, A., Harchaoui, Z. & Schmid, C. (2011). Actom sequence models for efficient action detection. In IEEE international conference on computer vision and pattern recognition (CVPR).

  • Gaidon, A., Harchaoui, Z., & Schmid, C. (2014). Activity representation with motion hierarchies. International Journal of Computer Vision (IJCV), 107(3), 219–238.

    Article  MathSciNet  Google Scholar 

  • Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 29(12), 2247–2253.

    Article  Google Scholar 

  • Hoai, M., Lan. Z. & Torre, F. (2011). Joint segmentation and classification of human actions in video. In IEEE international conference on computer vision and pattern recognition (CVPR).

  • Hu, N., Englebienne, G., Lou. Z. & Krose, B. (2014). Learning latent structure for activity recognition. In IEEE international conference on robotics and automation (ICRA).

  • Izadinia, H. & Shah, M. (2012). Recognizing complex events using large margin joint low-level event model. In European conference on computer vision (ECCV).

  • Jiang, Y., Dai, Q., Xue, X., Liu, W. & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In European conference on computer vision (ECCV).

  • Kliper, O., Gurovich, Y., Hassner, T., Wolf, L. (2012). Motion interchange patterns for action recognition in unconstrained videos. In European conference on computer vision (ECCV).

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision (IJCV), 64, 117–123.

    Article  Google Scholar 

  • Laxton, B., Lim, J. & Kriegman, D. (2007). Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Le, Q., Zou, W., Yeung, S. & Ng, A. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE international conference on computer vision and pattern recognition (CVPR).

  • Li, W. & Vasconcelos, N. (2012). Recognizing activities by attribute dynamics. In Neural information processing systems conference (NIPS).

  • Li, W., Zhang, Z., & Liu, Z. (2008). Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), 18(11), 1499–1510.

    Article  Google Scholar 

  • Li, W., Yu, Q., Sawhney, H. & Vasconcelos, N. (2013). Recognizing activities via bag of words for attribute dynamics. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Lillo, I., Soto, A., Niebles, J.C. (2014). Discriminative hierarchical modeling of spatio-temporally composable human activities. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Liu, J., Kuipers, B., Savarese, S. (2011). Recognizing human actions by attributes. In IEEE international conference on computer vision and pattern recognition (CVPR).

  • Niebles, J., Chen, C., Li, F. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision (ECCV).

  • Pirsiavash, H., Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In IEEE international conference on computer vision and pattern recognition (CVPR).

  • Rodriguez, M.D., Ahmed, J. & Shah, M. (2008). Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR).

  • Sadanand, S. & Corso, J.J. (2012). Action bank: a high-level representation of activity in video. In IEEE international conference on computer vision and pattern recognition (CVPR).

  • Sontag, D., Globerson, A., & Jaakkola, T. (2011). Introduction to dual decomposition for inference. Optimization for Machine Learning, 1, 219–254.

    Google Scholar 

  • Sun, C. & Nevatia, R. (2013). Active: Activity concept transitions in video event classification. In IEEE international conference on computer vision (ICCV).

  • Tang, K., Li, F.F., Koller, D. (2012). Learning latent temporal structure for complex event detection. In IEEE international conference on computer vision and pattern recognition (CVPR).

  • Wang, H., Klaser, A., Schmid, C., & Liu, C. L. (2013a). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision (IJCV), 103(1), 60–79.

    Article  MathSciNet  Google Scholar 

  • Wang, L., Qiao, Y., Tang, X., et al. (2013b). Mining motion atoms and phrases for complex action recognition. In IEEE international conference on computer vision (ICCV).

  • Wang, L., Qiao, Y., & Tang, X. (2014). Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing (T-IP), 23(2), 810–822.

    Article  MathSciNet  Google Scholar 

  • Wang, Y. & Mori, G. (2010). A discriminative latent model of image region and object tag correspondence. In Neural information processing systems conference (NIPS).

  • Weinland, D., Boyer, E. & Ronfard, R. (2007) Action recognition from arbitrary views using 3d exemplars. In IEEE international conference on computer vision (ICCV).

  • Wu, X., Xu, D., Duan, L. & Luo, J. (2011). Action recognition using context and appearance distribution features. In IEEE international conference on computer vision and pattern recognition (CVPR)

  • Yilmaz, A. & Shah, M. (2005). Action sketch: a novel action representation. In IEEE international conference on computer vision and pattern recognition (CVPR).

  • Yu, C. N. J. & Joachims, T. (2009). Learning structural svms with latent variables. In IEEE international conference on machine learning (ICML).

  • Yu, G., Yuan, J. & Liu, Z. (2012). Propagative hough voting for human activity recognition. In European conference on computer vision (ECCV).

  • Zhou, Q. & Wang, G. (2012). Atomic action features: A new feature for action recognition. In European conference on computer vision (ECCV).

  • Zhou, Q., Wang, G., Jia, K. & Zhao, Q. (2013). Learning to share latent tasks for action recognition. In: IEEE international conference on computer vision (ICCV).

Download references

Acknowledgments

This work was supported in part by the Natural Science Foundation of China (NSFC) under Grant Nos. 61203274, 61375044 and 61472038.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinxiao Wu.

Additional information

Communicated by Junsong Yuan, Wanqing Li, Zhengyou Zhang, David Fleet, and Jamie Shotton.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, C., Wu, X. & Jia, Y. A Hierarchical Video Description for Complex Activity Understanding. Int J Comput Vis 118, 240–255 (2016). https://doi.org/10.1007/s11263-016-0897-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0897-2

Keywords

Navigation