Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition

Ellis, Chris; Masood, Syed Zain; Tappen, Marshall F.; LaViola, Joseph J.; Sukthankar, Rahul

doi:10.1007/s11263-012-0550-7

Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition

Published: 18 August 2012

Volume 101, pages 420–436, (2013)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Chris Ellis¹,
Syed Zain Masood¹,
Marshall F. Tappen¹,
Joseph J. LaViola Jr.¹ &
…
Rahul Sukthankar²

2638 Accesses
174 Citations
6 Altmetric
Explore all metrics

Abstract

An important aspect in designing interactive, action-based interfaces is reliably recognizing actions with minimal latency. High latency causes the system’s feedback to lag behind user actions and thus significantly degrades the interactivity of the user experience. This paper presents algorithms for reducing latency when recognizing actions. We use a latency-aware learning formulation to train a logistic regression-based classifier that automatically determines distinctive canonical poses from data and uses these to robustly recognize actions in the presence of ambiguous poses. We introduce a novel (publicly released) dataset for the purpose of our experiments. Comparisons of our method against both a Bag of Words and a Conditional Random Field (CRF) classifier show improved recognition performance for both pre-segmented and online classification tasks. Additionally, we employ GentleBoost to reduce our feature set and further improve our results. We then present experiments that explore the accuracy/latency trade-off over a varying number of actions. Finally, we evaluate our algorithm on two existing datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic Feature Selection for Online Action Recognition

Action Detection with Improved Dense Trajectories and Sliding Window

Action Recognition with HOG-OF Features

Notes

See Sect. 5 for more details on the data gathering process.
The dataset has been made publicly available at http://www.cs.ucf.edu/~smasood/datasets/UCFKinect.zip.
The optimal value of the threshold T was found for each value of γ using the training set.

References

Ali, S., & Shah, M. (2010). Human action recognition in videos using kinematic features and multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 288–303.
Article Google Scholar
Alon, J., Athitsos, V., Yuan, Q., & Sclaroff, S. (2009). A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9), 1685–1699.
Article Google Scholar
Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11), 1222–1239.
Article Google Scholar
Cao, L., Liu, Z., & Huang, T. (2010). Cross-dataset action detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1998–2005).
Google Scholar
Carlsson, S., & Sullivan, J. (2001). Action recognition by shape matching to key frames. In IEEE international workshop at CVPR on models versus exemplars in computer vision.
Google Scholar
Cheema, S., Eweiwi, A., Thurau, C. & Bauckhage, C. (2011). Action recognition by learning discriminative key poses. In IEEE international workshop at ICCV on performance evaluation on recognition of human actions and pose estimation methods (pp. 1302–1309).
Google Scholar
Cuntoor, N., & Chellappa, R. (2006). Key frame-based activity representation using antieigenvalues. In Proceedings of the Asian conference on computer vision (pp. 499–508).
Google Scholar
Davis, J. W., & Tyagi, A. (2006). Minimal-latency human action recognition using reliable-inference. Image and Vision Computing, 24(5), 455–472.
Article Google Scholar
Felzenszwalb, P. F., Girshick, R. B., & Mcallester, D. (2010). Cascade object detection with deformable part models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2241–2248).
Google Scholar
Fothergill, S., Mentis, H. M., Kohli, P., & Nowozin, S. (2012). Instructing people for training gestural interactive systems. In Proceedings of the ACM conference on human factors in computing systems (pp. 1737–1746).
Google Scholar
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 38(2), 337–407.
Article MathSciNet Google Scholar
Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of general-activity human poses from depth images. In Proceedings of the IEEE international conference on computer vision (pp. 415–422).
Google Scholar
Guan, P., Weiss, A., Bălan, A. O., & Black, M. J. (2009). Estimating human shape and pose from a single image. In Proceedings of the IEEE international conference on computer vision (pp. 1381–1388).
Google Scholar
Hoai, M., & De la Torre, F. (2012). Max-margin early event detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Google Scholar
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning (pp. 282–289).
Google Scholar
Li, W., Zhang, Z., & Liu, Z. (2010). Action recognition based on a bag of 3D points. In IEEE international workshop at CVPR on human communicative behavior analysis (pp. 9–14).
Google Scholar
Liu, J., & Shah, M. (2008). Learning human actions via information maximization. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Google Scholar
Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In Proceedings of the IEEE European conference on computer vision (pp. 359–372).
Google Scholar
Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).
Google Scholar
Martens, J., & Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free optimization. In Proceedings of the international conference on machine learning (pp. 1033–1040).
Google Scholar
Masood, S., Nagaraja, A., Khan, N., Zhu, J., & Tappen, M. (2011). Correcting cuboid corruption for action recognition in complex environment. In IEEE international workshop at ICCV on video event categorization, tagging and retrieval for real-world applications (pp. 1540–1547).
Google Scholar
Metacritic (2011). Fighters uncaged critic reviews. http://www.metacritic.com/game/xbox-360/fighters-uncaged/critic-reviews.
Müller, M., & Röder, T. (2006). Motion templates for automatic classification and retrieval of motion capture data. In Proceedings of SIGGRAPH/Eurographics symposium on computer animation (pp. 137–146).
Google Scholar
Narasimhan, M., Viola, P. A., & Shilman, M. (2006). Online decoding of Markov models under latency constraints. In Proceedings of the international conference on machine learning (pp. 657–664).
Google Scholar
Norton, J., Wingrave, C., & LaViola, J. (2010). Exploring strategies and guidelines for developing full body video game interfaces. In Proceedings of the international conference on the foundations of digital games (pp. 155–162).
Google Scholar
Ramanan, D., Forsyth, D. A., & Zisserman, A. (2005). Strike a pose: tracking people by finding stylized poses. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 271–278).
Google Scholar
Raptis, M., Kirovski, D., & Hoppe, H. (2011). Real-time classification of dance gestures from skeleton animation. In Proceedings of SIGGRAPH/Eurographics symposium on computer animation (pp. 147–156).
Google Scholar
Schindler, K., & Van Gool, L. J. (2008). Action snippets: How many frames does human action recognition require? In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).
Google Scholar
Shao, L., & Ji, L. (2009). Motion histogram analysis based key frame extraction for human action/activity representation. In Proceedings of the conference on computer and robot vision (pp. 88–92).
Google Scholar
Shen, Y., & Foroosh, H. (2009). View-invariant action recognition from point triplets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 1898–1905.
Article Google Scholar
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from a single depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1297–1304).
Google Scholar
Sigal, L., Balan, A., & Black, M. J. (2010). HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1–2).
Vahdat, A., Gao, B., Ranjbar, M., & Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In IEEE international workshop at ICCV on visual surveillance (pp. 1729–1736).
Google Scholar
Viola, P., & Jones, M. (2001a). Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 511–518).
Google Scholar
Viola, P., & Jones, M. (2001b). Robust real-time object detection. International Journal of Computer Vision, 57(2), 137–154.
Article Google Scholar
Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1290–1297).
Google Scholar
Yang, W., Wang, Y., & Mori, G. (2010). Recognizing human actions from still images with latent poses. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2030–2037).
Google Scholar
Yuan, J., Liu, Z., & Wu, Y. (2009). Discriminative subvolume search for efficient action detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2442–2449).
Google Scholar
Zhao, Z., & Elgammal, A. (2008). Information theoretic key frame selection for action recognition. In Proceedings of the British machine vision conference (pp. 1–10).
Google Scholar

Download references

Acknowledgements

Marshall F. Tappen, Syed Z. Masood and Chris Ellis were supported by NSF grants IIS-0905387 and IIS-0916868. Joseph J. LaViola Jr. was supported by NSF CAREER award IIS-0845921 and NSF awards IIS-0856045 and CCF-1012056.

Author information

Authors and Affiliations

Department of Computer Science, University of Central Florida, Orlando, FL, 32826, USA
Chris Ellis, Syed Zain Masood, Marshall F. Tappen & Joseph J. LaViola Jr.
Google Research, Google Inc, Mountain View, CA, 94043, USA
Rahul Sukthankar

Authors

Chris Ellis
View author publications
You can also search for this author in PubMed Google Scholar
Syed Zain Masood
View author publications
You can also search for this author in PubMed Google Scholar
Marshall F. Tappen
View author publications
You can also search for this author in PubMed Google Scholar
Joseph J. LaViola Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Sukthankar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Syed Zain Masood.

Additional information

S.Z. Masood and C. Ellis contributed equally towards this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ellis, C., Masood, S.Z., Tappen, M.F. et al. Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition. Int J Comput Vis 101, 420–436 (2013). https://doi.org/10.1007/s11263-012-0550-7

Download citation

Received: 05 December 2011
Accepted: 19 July 2012
Published: 18 August 2012
Issue Date: February 2013
DOI: https://doi.org/10.1007/s11263-012-0550-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition

Abstract

Access this article

Similar content being viewed by others

Dynamic Feature Selection for Online Action Recognition

Action Detection with Improved Dense Trajectories and Sliding Window

Action Recognition with HOG-OF Features

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition

Abstract

Access this article

Similar content being viewed by others

Dynamic Feature Selection for Online Action Recognition

Action Detection with Improved Dense Trajectories and Sliding Window

Action Recognition with HOG-OF Features

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation