EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Tran, Du; Torresani, Lorenzo

doi:10.1007/s11263-016-0905-6

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Published: 26 April 2016

Volume 119, pages 239–253, (2016)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Du Tran¹ &
Lorenzo Torresani¹

1033 Accesses
10 Citations
Explore all metrics

Abstract

In this paper we present EXMOVES—learned exemplar-based features for efficient recognition and analysis of actions in videos. The entries in our descriptor are produced by evaluating a set of movement classifiers over spatial-temporal volumes of the input video sequences. Each movement classifier is a simple exemplar-SVM trained on low-level features, i.e., an SVM learned using a single annotated positive space-time volume and a large number of unannotated videos. Our representation offers several advantages. First, since our mid-level features are learned from individual video exemplars, they require minimal amount of supervision. Second, we show that simple linear classification models trained on our global video descriptor yield action recognition accuracy approaching the state-of-the-art but at orders of magnitude lower cost, since at test-time no sliding window is necessary and linear models are efficient to train and test. This enables scalable action recognition, i.e., efficient classification of a large number of actions even in massive video databases. Third, we show the generality of our approach by training our mid-level descriptors from different low-level features and testing them on two distinct video analysis tasks: human activity recognition as well as action similarity labeling. Experiments on large-scale benchmarks demonstrate the accuracy and efficiency of our proposed method on both these tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human Action Recognition and Prediction: A Survey

Article 28 March 2022

Yu Kong & Yun Fu

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

Article 25 September 2020

Preksha Pareek & Ankit Thakkar

Fully-Convolutional Siamese Networks for Object Tracking

References

Blank, M., Gorelick, L., Shechtman, E., Irani, M. & Basri, R. (2005). Actions as space-time shapes. In International Conference on Computer Vision, (pp. 1395–1402)
Chapelle, O. & Keerthi, S. (2008). Multi-class feature selection with support vector machines. In Proceedings of the American Statistical Association
Dalal, N. & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition
Dalal, N., Triggs, B. & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European Conference on Computer Vision
Deng, J., Berg, A., & Fei-Fei, L. (2011). Hierarchical semantic indexing for large scale image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition
Derpanis, K., Sizintsev, M., Cannons, K. & Wildes, P. (2010). Efficient action spotting based on a spacetime oriented structure representation. In IEEE Conference on Computer Vision and Pattern Recognition
Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, (pp. 65–72)
Efros, A., Berg, A., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In International Conference on Computer Vision, (pp. 726–733)
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.A. (2009). Describing objects by their attributes. In CVPR, (pp. 1778–1785)
Fathi, A., & Mori, G. (2008). Action recognition by learning mid-level motion features. In CVPR
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Article Google Scholar
Ferrari, V., & Zisserman, A. (2007). Learning visual attributes. In Proceedings Neural Information Processing Systems, (pp. 433–440)
Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., & Huang, T. (2009). Action detection in complex scenes with spatial and temporal ambiguities. In International Conference on Computer Vision
Jain, A., Gupta, A., Rodriguez, M., & Davis, L. (2013). Representing videos using mid-level discriminative patches. In IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2571–2578)
Ke, Y., Sukthankar, R., & Hebert, M. (2005). Efficient visual event detection using volumetric features. In International Conference on Computer Vision
Ke, Y., Sukthankar, R., & Hebert, M. (2010). Volumetric features for video event detection. International Journal of Computer Vision
Klaser, A., Marszalek, M., & Schmid, C. (2008) A spatio-temporal descriptor based on 3D-gradients. In British Machine Vision Conference
Kliper-Gross, O., Hassner, T., & Wolf, L. (2012). The action similarity labeling challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 615–621.
Article Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In International Conference on Computer Vision
Lampert, C., Blaschko, M., & Hofmann, T. (2009). Efficient subwindow search: A branch and bound framework for object localization. IEEE Transactions on Pattern Analysis and Machine Intelligence
Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In Proceedings on IEEE Conference on Computer Vision and Pattern Recognition, (pp. 951–958)
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.
Article Google Scholar
Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In International Conference on Computer Vision
Laptev, I., Marszalek, M., C.S., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition
Laptev, I., & Prez, P. (2007). Retrieving actions in movies. In International Conference on Computer Vision
Le, Q., Zou, W., Yeung, S., & Ng, A. (2011). Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. In IEEE Conference on Computer Vision and Pattern Recognition
Li, L., Su, H., Xing, E., & Fei-Fei, L. (2010). Object Bank: A high-level image representation for scene classification & semantic feature sparsification. In Neural Information Processing Systems
Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR, (pp. 3337–3344)
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision
Malisiewicz, T., Gupta, A., & Efros, A. (2011). Ensemble of exemplar-SVMs for object detection and beyond. In International Conference on Computer Vision
Marszalek, M., Laptev, I., & Schmid, C. (2009). Action in context. In IEEE Conference on Computer Vision and Pattern Recognition
Niebles, J., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In IEEE Conference on Computer Vision and Pattern Recognition
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, MIT Press.
Ryoo, M., & Aggarwal, J. (2010). UT-interaction dataset, ICPR contest on semantic description of human activities
Sadanand, S., & Corso, J. (2012). Action bank: A high-level representation of activity in video. In IEEE Conference on Computer Vision and Pattern Recognition
Scovanner, P., Ali, S., & Shah, M. (2007) A 3-Dimensional SIFT descriptor and its application to action recognition. In ACM Conference on Multimedia, (pp. 357–360)
Soomro, K., Roshan Zamir, A., & Shah, M. (2013). UCF101: A dataset of 101 human action classes from videos in the wild. Tech. Rep. CRCV-TR-12-01, University of Central Florida
Torresani, L., Szummer, M., & Fitzgibbon, A. (2010). Efficient object category recognition using classemes. In European Conference on Computer Vision
Tran, D., & Sorokin, A. (2008). Human activity recognition with metric learning. In European Conference on Computer Vision
Veeraraghavan, A., Chellappa, R., & Roy-Chowdhury, A. (2006). The function space of an activity. In IEEE Conference on Computer Vision and Pattern Recognition
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In IEEE Conference on Computer Vision and Pattern Recognition
Wang, G., Hoiem, D., & Forsyth, D. (2009). Learning image similarity from flickr using stochastic intersection kernel machines. In International Conference on Computer Vision
Wang, H., Kläser, A., Schmid, C., & Liu, C. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision
Wang, L., & Suter, D. (2007). Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In IEEE Conference on Computer Vision and Pattern Recognition
Wang, Y., Tran, D., & Liao, Z. (2011). Learning hierarchical poselets for human parsing. In IEEE Conference on Computer Vision and Pattern Recognition
Weinberger, K., Blitzer, J., & Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. In Neural Information Processing Systems
Yu, G., Yuan, J., & Liu, Z. (2011). Unsupervised random forest indexing for fast action search. In CVPR, (pp. 865–872)
Yu, G., Yuan, J., & Liu, Z. (2012). Propagative hough voting for human activity recognition. In European Conference on Computer Vision
Yuan, J., Liu, Z., & Wu, Y. (2009). Discriminative subvolume search for efficient action detection. In IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2442–2449)
Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence

Download references

Acknowledgments

Thanks to Alessandro Bergamo for assistance with the experiments. We are grateful to Tomasz Malisiewicz for clarifications about Exemplar SVM and to Jason Corso for providing the Action Bank exemplars. This research was funded in part by NSF CAREER award IIS-0952943 and NSF award CNS-1205521.

Author information

Authors and Affiliations

Computer Science Department, Dartmouth College, 6211 Sudikoff Lab, Hanover, NH, 03755, USA
Du Tran & Lorenzo Torresani

Authors

Du Tran
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Torresani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lorenzo Torresani.

Additional information

Communicated by Ivan Laptev, Josef Sivic, Deva Ramanan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tran, D., Torresani, L. EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis. Int J Comput Vis 119, 239–253 (2016). https://doi.org/10.1007/s11263-016-0905-6

Download citation

Received: 15 May 2014
Accepted: 04 April 2016
Published: 26 April 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s11263-016-0905-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Abstract

Access this article

Similar content being viewed by others

Human Action Recognition and Prediction: A Survey

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

Fully-Convolutional Siamese Networks for Object Tracking

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Abstract

Access this article

Similar content being viewed by others

Human Action Recognition and Prediction: A Survey

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

Fully-Convolutional Siamese Networks for Object Tracking

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation