Space-Time Tree Ensemble for Action Recognition and Localization

Ma, Shugao; Zhang, Jianming; Sclaroff, Stan; Ikizler-Cinbis, Nazli; Sigal, Leonid

doi:10.1007/s11263-016-0980-8

Space-Time Tree Ensemble for Action Recognition and Localization

Published: 02 February 2017

Volume 126, pages 314–332, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Shugao Ma ORCID: orcid.org/0000-0002-4986-2221¹,
Jianming Zhang²,
Stan Sclaroff¹,
Nazli Ikizler-Cinbis³ &
…
Leonid Sigal⁴

1696 Accesses
11 Citations
Explore all metrics

Abstract

Human actions are, inherently, structured patterns of body movements. We explore ensembles of hierarchical spatio-temporal trees, discovered directly from training data, to model these structures for action recognition and spatial localization. Discovery of frequent and discriminative tree structures is challenging due to the exponential search space, particularly if one allows partial matching. We address this by first building a concise action word vocabulary via discriminative clustering of the hierarchical space-time segments, which is a two-level video representation that captures both static and non-static relevant space-time segments of the video. Using this vocabulary we then utilize tree mining with subsequent tree clustering and ranking to select a compact set of discriminative tree patterns. Our experiments show that these tree patterns, alone, or in combination with shorter patterns (action words and pairwise patterns) achieve promising performance on three challenging datasets: UCF Sports, HighFive and Hollywood3D. Moreover, we perform cross-dataset validation, using trees learned on HighFive to recognize the same actions in Hollywood3D, and using trees learned on UCF-Sports to recognize and localize the similar actions in JHMDB. The results demonstrate the potential for cross-dataset generalization of the trees our approach discovers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Article 11 April 2015

A survey on semi-supervised learning

Article Open access 15 November 2019

Notes

To avoid notation clutter, we omit the action class label a for \(\mathcal {T}\), \(\mathbf {w}\), \(\Phi \), \(\phi \) and \(\varphi \).
Note that we use notation \(\mathcal {T}\) to denote discovered tree structures of human actions, and notation \(\mathbf {T}\) to denote image segment trees from video frame hierarchical segmentation.
We did not find previous works reporting action classification and localization results for these individual action classes for comparison.

References

Aoun, N. B., Mejdoub, M., & Amar, C. B. (2014). Graph-based approach for human action recognition using spatio-temporal features. Journal of Visual Communication and Image Representation, 25(2), 329–338.
Article Google Scholar
Arbelaez, P., Maire, M., Fowlkes, C. C., Malik J. (2009). From contours to regions: An empirical evaluation. In CVPR.
Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. TPAMI, 23(3), 257–267.
Article Google Scholar
Brendel, W., Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In ICCV.
Cheáron, G., Laptev, I., Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In ICCV.
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265–292.
MATH Google Scholar
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. JMLR, 9, 1871–1874.
MATH Google Scholar
Felzenszwalb, P. F., & Zabih, R. (2011). Dynamic programming and graph algorithms in computer vision. TPAMI, 33(4), 721–740.
Article Google Scholar
Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315, 972–976.
Article MathSciNet MATH Google Scholar
Gaidon, A., Harchaoui, Z., & Schmid, C. (2014). Activity representation with motion hierarchies. IJCV, 107(3), 219–238.
Article MathSciNet Google Scholar
Gilbert, A., Bowden, R. (2014). Data mining for action recognition. In ACCV.
Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. TPAMI, 33(5), 883–897.
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Gkioxari, G., Malik, J. (2015). Finding action tubes. In CVPR.
Gkioxari, G., Girshick, R., Malik, J. (2015). Contextual action recognition with R*CNN. In ICCV.
Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. TPAMI, 29(12), 2247–2253.
Article Google Scholar
Hadfield, S., Bowden, R. (2013). Hollywood 3D: Recognizing actions in 3D natural scenes. In CVPR.
Hadfield, S., Lebeda, K., Bowden, R. (2014). Natural action recognition using invariant 3D motion encoding. In ECCV.
Hoai, M., Zisserman, A. (2013). Discriminative sub-categorization. In CVPR.
Ikizler, N., & Forsyth, D. A. (2008). Searching for complex human activities with no visual examples. IJCV, 80(3), 337–357.
Article Google Scholar
Ikizler-Cinbis, N., Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In ECCV.
Iosifidis, A., Tefas, A., Pitas, I. (2014). Human action recognition based on bag of features and multi-view neural networks. In ICIP.
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M. J. (2013). Towards understanding action recognition. In ICCV.
Kantorov, V., Laptev, I. (2014). Efficient feature extraction, encoding, and classification for action recognition. In CVPR.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.
Lan, T., Wang, Y., Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In ICCV.
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.
Leordeanu, M., Sukthankar, R., Sminchisescu, C. (2012). Efficient closed-form solution to generalized boundary detection. In ECCV.
Ma, S., Zhang, J., Ikizler-Cinbis, N., Sclaroff, S. (2013). Action recognition and localization by hierarchical space-time segments. In ICCV.
Ma, S., Sigal, L., Sclaroff, S. (2015). Space-time tree ensemble for action recognition. In CVPR.
Marszałek, M., Laptev, I., Schmid, C. (2009). Actions in context. In CVPR.
Matikainen, P., Hebert, M., Sukthankar, R. (2010). Representing pairwise spatial and temporal relations for action recognition. In ECCV.
Mikolajczyk, K., & Uemura, H. (2011). Action recognition with appearance-motion features and fast search trees. CVIU, 115(3), 426–438.
Google Scholar
Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G. (2015). Beyond short snippets: deep networks for video classification. In CVPR.
Nijssen, S., Kok, J. N. (2005). A quickstart in frequent structure mining can make a difference. In ICCS.
Oneata, D., Verbeek, J., Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.
Patron-Perez, A., Marszalek, M., Zisserman, A., Reid, I. D. (2010). High five: Recognising human interactions in TV shows. In BMVC.
Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2012). Structured learning of human interactions in TV shows. TPAMI, 34(12), 2441–2453.
Article Google Scholar
Perronnin, F., Sánchez, J., Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV.
Ramanan, D., Forsyth, D. A. (2003). Automatic annotation of everyday movements. In NIPS.
Raptis, M., Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In CVPR.
Raptis, M., Kokkinos, I., Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In CVPR.
Rodriguez, M. D., Ahmed, J., Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.
Sadanand, S., Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR.
Simonyan, K., Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
Tian, Y., Sukthankar, R., Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.
Todorovic, S. (2012). Human activities as stochastic kronecker graphs. In ECCV.
Tran, D., Yuan, J. (2011). Optimal spatio-temporal path discovery for video event detection. In CVPR.
Tran, D., Yuan, J. (2012). Max-margin structured output regression for spatio-temporal action localization. In NIPS.
Wang, H., Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.
Article MathSciNet Google Scholar
Wang, H., Oneata, D., Verbeek, J., & Schmid, C. (2016). A robust and efficient video representation for action recognition. IJCV, 119(3), 219–238.
Article MathSciNet Google Scholar
Wang, L., Sahbi, H. (2013). Directed acyclic graph kernels for action recognition. In ICCV.
Wang, L., Qiao, Y., Tang, X. (2014). Video action detection with relational dynamic-poselets. In ECCV.
Wang, Y., Mori, G. (2008). Learning a discriminative hidden part model for human action recognition. In NIPS.
Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. TPAMI, 33(7), 1310–1323.
Article Google Scholar
Wang, Y., Huang, K., Tan, T. (2007). Human activity recognition based on r transform. In CVPR.
Wang, Y., Tran, D., Liao, Z., & Forsyth, D. (2012). Discriminative hierarchical part-based models for human parsing and action recognition. JMLR, 13, 30753102.
MathSciNet MATH Google Scholar
Weinland, D., Boyer, E., Ronfard, R. (2007). Action recognition from arbitrary views using 3D exemplars. In ICCV.
Weinzaepfel, P., Harchaoui, Z., Schmid, C. (2015). Learning to track for spatio-temporal action localization. In ICCV.
Wu, B., Yuan, C., Hu, W. (2014). Human action recognition based on context-dependent graph kernels. In CVPR.
Wu, Z., Wang, X., Jiang, Y., Ye, H., Xue, X. (2015). Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia.
Xie, Y., Chang, H., Li, Z., Liang, L., Chen, X., Zhao, D. (2011). A unified framework for locating and recognizing human actions. In CVPR.
Yang, X., Tian, Y. (2014). Action recognition using super sparse coding vector with spatio-temporal awareness. In ECCV.
Zhang, H., Zhou, W., Reardon, C. M., Parker, L. E. (2014). Simplex-based 3D spatio-temporal feature description for action recognition. In CVPR.
Zitnick, C. L., Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision.

Download references

Acknowledgements

This work was supported in part through a Google Faculty Research Award and by US NSF grants 0855065, 0910908, and 1029430.

Author information

Authors and Affiliations

Computer Science, Boston University, Boston, MA, USA
Shugao Ma & Stan Sclaroff
Adobe Research, San Jose, CA, USA
Jianming Zhang
Computer Engineering, Hacettepe University, Ankara, Turkey
Nazli Ikizler-Cinbis
Disney Research, Pittsburgh, PA, USA
Leonid Sigal

Authors

Shugao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jianming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Stan Sclaroff
View author publications
You can also search for this author in PubMed Google Scholar
Nazli Ikizler-Cinbis
View author publications
You can also search for this author in PubMed Google Scholar
Leonid Sigal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shugao Ma.

Additional information

Communicated by Ivan Laptev and Cordelia Schmid.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, S., Zhang, J., Sclaroff, S. et al. Space-Time Tree Ensemble for Action Recognition and Localization. Int J Comput Vis 126, 314–332 (2018). https://doi.org/10.1007/s11263-016-0980-8

Download citation

Received: 02 March 2016
Accepted: 08 December 2016
Published: 02 February 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11263-016-0980-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Space-Time Tree Ensemble for Action Recognition and Localization

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

A survey on semi-supervised learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Space-Time Tree Ensemble for Action Recognition and Localization

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

A survey on semi-supervised learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation