Abstract
This paper addresses the challenging problem of complex human activity understanding from long videos. Towards this goal, we propose a hierarchical description of an activity video, referring to the “which” of activities, “what” of atomic actions, and “when” of atomic actions happening in the video. In our work, each complex activity is characterized as a composition of simple motion units (called atomic actions), and different atomic actions are explained by different video segments. We develop a latent discriminative structural model to detect the complex activity and atomic actions, while learning the temporal structure of atomic actions simultaneously. A segment-annotation mapping matrix is introduced for relating video segments to their associational atomic actions, allowing different video segments to explain different atomic actions. The segment-annotation mapping matrix is treated as latent information in the model, since its ground-truth is not available during both training and testing. Moreover, we present a semi-supervised learning method to automatically predict the atomic action labels of unlabeled training videos when the labeled training data is limited, which could greatly alleviate the laborious and time-consuming annotations of atomic actions for training data. Experiments on three activity datasets demonstrate that our method is able to achieve promising activity recognition results and obtain rich and hierarchical descriptions of activity videos.
Similar content being viewed by others
References
Bhattacharya, S., Kalayeh, M. M., Sukthankar, R. & Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In IEEE international conference on computer vision and pattern recognition (CVPR).
Do, T. M. T. & Artieres, T. (2009). Large margin training for hidden markov models with partially observed states. In IEEE international conference on machine learning (ICML).
Dollar, P., Rabaud, V., Cottrell, G. & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In VS PETS.
Efros, A.A., Berg, A.C., Mori, G. & Malik, J. (2003). Recognizing action at a distance. In IEEE international conference on computer vision (ICCV).
Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 32(9), 1627–1645.
Gaidon, A., Harchaoui, Z. & Schmid, C. (2011). Actom sequence models for efficient action detection. In IEEE international conference on computer vision and pattern recognition (CVPR).
Gaidon, A., Harchaoui, Z., & Schmid, C. (2014). Activity representation with motion hierarchies. International Journal of Computer Vision (IJCV), 107(3), 219–238.
Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 29(12), 2247–2253.
Hoai, M., Lan. Z. & Torre, F. (2011). Joint segmentation and classification of human actions in video. In IEEE international conference on computer vision and pattern recognition (CVPR).
Hu, N., Englebienne, G., Lou. Z. & Krose, B. (2014). Learning latent structure for activity recognition. In IEEE international conference on robotics and automation (ICRA).
Izadinia, H. & Shah, M. (2012). Recognizing complex events using large margin joint low-level event model. In European conference on computer vision (ECCV).
Jiang, Y., Dai, Q., Xue, X., Liu, W. & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In European conference on computer vision (ECCV).
Kliper, O., Gurovich, Y., Hassner, T., Wolf, L. (2012). Motion interchange patterns for action recognition in unconstrained videos. In European conference on computer vision (ECCV).
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision (IJCV), 64, 117–123.
Laxton, B., Lim, J. & Kriegman, D. (2007). Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In IEEE conference on computer vision and pattern recognition (CVPR).
Le, Q., Zou, W., Yeung, S. & Ng, A. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE international conference on computer vision and pattern recognition (CVPR).
Li, W. & Vasconcelos, N. (2012). Recognizing activities by attribute dynamics. In Neural information processing systems conference (NIPS).
Li, W., Zhang, Z., & Liu, Z. (2008). Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), 18(11), 1499–1510.
Li, W., Yu, Q., Sawhney, H. & Vasconcelos, N. (2013). Recognizing activities via bag of words for attribute dynamics. In IEEE conference on computer vision and pattern recognition (CVPR).
Lillo, I., Soto, A., Niebles, J.C. (2014). Discriminative hierarchical modeling of spatio-temporally composable human activities. In IEEE conference on computer vision and pattern recognition (CVPR).
Liu, J., Kuipers, B., Savarese, S. (2011). Recognizing human actions by attributes. In IEEE international conference on computer vision and pattern recognition (CVPR).
Niebles, J., Chen, C., Li, F. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision (ECCV).
Pirsiavash, H., Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In IEEE international conference on computer vision and pattern recognition (CVPR).
Rodriguez, M.D., Ahmed, J. & Shah, M. (2008). Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR).
Sadanand, S. & Corso, J.J. (2012). Action bank: a high-level representation of activity in video. In IEEE international conference on computer vision and pattern recognition (CVPR).
Sontag, D., Globerson, A., & Jaakkola, T. (2011). Introduction to dual decomposition for inference. Optimization for Machine Learning, 1, 219–254.
Sun, C. & Nevatia, R. (2013). Active: Activity concept transitions in video event classification. In IEEE international conference on computer vision (ICCV).
Tang, K., Li, F.F., Koller, D. (2012). Learning latent temporal structure for complex event detection. In IEEE international conference on computer vision and pattern recognition (CVPR).
Wang, H., Klaser, A., Schmid, C., & Liu, C. L. (2013a). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision (IJCV), 103(1), 60–79.
Wang, L., Qiao, Y., Tang, X., et al. (2013b). Mining motion atoms and phrases for complex action recognition. In IEEE international conference on computer vision (ICCV).
Wang, L., Qiao, Y., & Tang, X. (2014). Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing (T-IP), 23(2), 810–822.
Wang, Y. & Mori, G. (2010). A discriminative latent model of image region and object tag correspondence. In Neural information processing systems conference (NIPS).
Weinland, D., Boyer, E. & Ronfard, R. (2007) Action recognition from arbitrary views using 3d exemplars. In IEEE international conference on computer vision (ICCV).
Wu, X., Xu, D., Duan, L. & Luo, J. (2011). Action recognition using context and appearance distribution features. In IEEE international conference on computer vision and pattern recognition (CVPR)
Yilmaz, A. & Shah, M. (2005). Action sketch: a novel action representation. In IEEE international conference on computer vision and pattern recognition (CVPR).
Yu, C. N. J. & Joachims, T. (2009). Learning structural svms with latent variables. In IEEE international conference on machine learning (ICML).
Yu, G., Yuan, J. & Liu, Z. (2012). Propagative hough voting for human activity recognition. In European conference on computer vision (ECCV).
Zhou, Q. & Wang, G. (2012). Atomic action features: A new feature for action recognition. In European conference on computer vision (ECCV).
Zhou, Q., Wang, G., Jia, K. & Zhao, Q. (2013). Learning to share latent tasks for action recognition. In: IEEE international conference on computer vision (ICCV).
Acknowledgments
This work was supported in part by the Natural Science Foundation of China (NSFC) under Grant Nos. 61203274, 61375044 and 61472038.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Junsong Yuan, Wanqing Li, Zhengyou Zhang, David Fleet, and Jamie Shotton.
Rights and permissions
About this article
Cite this article
Liu, C., Wu, X. & Jia, Y. A Hierarchical Video Description for Complex Activity Understanding. Int J Comput Vis 118, 240–255 (2016). https://doi.org/10.1007/s11263-016-0897-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-016-0897-2