Skip to main content
Log in

Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

An important aspect in designing interactive, action-based interfaces is reliably recognizing actions with minimal latency. High latency causes the system’s feedback to lag behind user actions and thus significantly degrades the interactivity of the user experience. This paper presents algorithms for reducing latency when recognizing actions. We use a latency-aware learning formulation to train a logistic regression-based classifier that automatically determines distinctive canonical poses from data and uses these to robustly recognize actions in the presence of ambiguous poses. We introduce a novel (publicly released) dataset for the purpose of our experiments. Comparisons of our method against both a Bag of Words and a Conditional Random Field (CRF) classifier show improved recognition performance for both pre-segmented and online classification tasks. Additionally, we employ GentleBoost to reduce our feature set and further improve our results. We then present experiments that explore the accuracy/latency trade-off over a varying number of actions. Finally, we evaluate our algorithm on two existing datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. See Sect. 5 for more details on the data gathering process.

  2. The dataset has been made publicly available at http://www.cs.ucf.edu/~smasood/datasets/UCFKinect.zip.

  3. The optimal value of the threshold T was found for each value of γ using the training set.

References

  • Ali, S., & Shah, M. (2010). Human action recognition in videos using kinematic features and multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 288–303.

    Article  Google Scholar 

  • Alon, J., Athitsos, V., Yuan, Q., & Sclaroff, S. (2009). A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9), 1685–1699.

    Article  Google Scholar 

  • Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11), 1222–1239.

    Article  Google Scholar 

  • Cao, L., Liu, Z., & Huang, T. (2010). Cross-dataset action detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1998–2005).

    Google Scholar 

  • Carlsson, S., & Sullivan, J. (2001). Action recognition by shape matching to key frames. In IEEE international workshop at CVPR on models versus exemplars in computer vision.

    Google Scholar 

  • Cheema, S., Eweiwi, A., Thurau, C. & Bauckhage, C. (2011). Action recognition by learning discriminative key poses. In IEEE international workshop at ICCV on performance evaluation on recognition of human actions and pose estimation methods (pp. 1302–1309).

    Google Scholar 

  • Cuntoor, N., & Chellappa, R. (2006). Key frame-based activity representation using antieigenvalues. In Proceedings of the Asian conference on computer vision (pp. 499–508).

    Google Scholar 

  • Davis, J. W., & Tyagi, A. (2006). Minimal-latency human action recognition using reliable-inference. Image and Vision Computing, 24(5), 455–472.

    Article  Google Scholar 

  • Felzenszwalb, P. F., Girshick, R. B., & Mcallester, D. (2010). Cascade object detection with deformable part models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2241–2248).

    Google Scholar 

  • Fothergill, S., Mentis, H. M., Kohli, P., & Nowozin, S. (2012). Instructing people for training gestural interactive systems. In Proceedings of the ACM conference on human factors in computing systems (pp. 1737–1746).

    Google Scholar 

  • Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 38(2), 337–407.

    Article  MathSciNet  Google Scholar 

  • Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of general-activity human poses from depth images. In Proceedings of the IEEE international conference on computer vision (pp. 415–422).

    Google Scholar 

  • Guan, P., Weiss, A., Bălan, A. O., & Black, M. J. (2009). Estimating human shape and pose from a single image. In Proceedings of the IEEE international conference on computer vision (pp. 1381–1388).

    Google Scholar 

  • Hoai, M., & De la Torre, F. (2012). Max-margin early event detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning (pp. 282–289).

    Google Scholar 

  • Li, W., Zhang, Z., & Liu, Z. (2010). Action recognition based on a bag of 3D points. In IEEE international workshop at CVPR on human communicative behavior analysis (pp. 9–14).

    Google Scholar 

  • Liu, J., & Shah, M. (2008). Learning human actions via information maximization. In Proceedings of the IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In Proceedings of the IEEE European conference on computer vision (pp. 359–372).

    Google Scholar 

  • Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).

    Google Scholar 

  • Martens, J., & Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free optimization. In Proceedings of the international conference on machine learning (pp. 1033–1040).

    Google Scholar 

  • Masood, S., Nagaraja, A., Khan, N., Zhu, J., & Tappen, M. (2011). Correcting cuboid corruption for action recognition in complex environment. In IEEE international workshop at ICCV on video event categorization, tagging and retrieval for real-world applications (pp. 1540–1547).

    Google Scholar 

  • Metacritic (2011). Fighters uncaged critic reviews. http://www.metacritic.com/game/xbox-360/fighters-uncaged/critic-reviews.

  • Müller, M., & Röder, T. (2006). Motion templates for automatic classification and retrieval of motion capture data. In Proceedings of SIGGRAPH/Eurographics symposium on computer animation (pp. 137–146).

    Google Scholar 

  • Narasimhan, M., Viola, P. A., & Shilman, M. (2006). Online decoding of Markov models under latency constraints. In Proceedings of the international conference on machine learning (pp. 657–664).

    Google Scholar 

  • Norton, J., Wingrave, C., & LaViola, J. (2010). Exploring strategies and guidelines for developing full body video game interfaces. In Proceedings of the international conference on the foundations of digital games (pp. 155–162).

    Google Scholar 

  • Ramanan, D., Forsyth, D. A., & Zisserman, A. (2005). Strike a pose: tracking people by finding stylized poses. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 271–278).

    Google Scholar 

  • Raptis, M., Kirovski, D., & Hoppe, H. (2011). Real-time classification of dance gestures from skeleton animation. In Proceedings of SIGGRAPH/Eurographics symposium on computer animation (pp. 147–156).

    Google Scholar 

  • Schindler, K., & Van Gool, L. J. (2008). Action snippets: How many frames does human action recognition require? In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).

    Google Scholar 

  • Shao, L., & Ji, L. (2009). Motion histogram analysis based key frame extraction for human action/activity representation. In Proceedings of the conference on computer and robot vision (pp. 88–92).

    Google Scholar 

  • Shen, Y., & Foroosh, H. (2009). View-invariant action recognition from point triplets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 1898–1905.

    Article  Google Scholar 

  • Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from a single depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1297–1304).

    Google Scholar 

  • Sigal, L., Balan, A., & Black, M. J. (2010). HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1–2).

  • Vahdat, A., Gao, B., Ranjbar, M., & Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In IEEE international workshop at ICCV on visual surveillance (pp. 1729–1736).

    Google Scholar 

  • Viola, P., & Jones, M. (2001a). Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 511–518).

    Google Scholar 

  • Viola, P., & Jones, M. (2001b). Robust real-time object detection. International Journal of Computer Vision, 57(2), 137–154.

    Article  Google Scholar 

  • Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1290–1297).

    Google Scholar 

  • Yang, W., Wang, Y., & Mori, G. (2010). Recognizing human actions from still images with latent poses. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2030–2037).

    Google Scholar 

  • Yuan, J., Liu, Z., & Wu, Y. (2009). Discriminative subvolume search for efficient action detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2442–2449).

    Google Scholar 

  • Zhao, Z., & Elgammal, A. (2008). Information theoretic key frame selection for action recognition. In Proceedings of the British machine vision conference (pp. 1–10).

    Google Scholar 

Download references

Acknowledgements

Marshall F. Tappen, Syed Z. Masood and Chris Ellis were supported by NSF grants IIS-0905387 and IIS-0916868. Joseph J. LaViola Jr. was supported by NSF CAREER award IIS-0845921 and NSF awards IIS-0856045 and CCF-1012056.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Syed Zain Masood.

Additional information

S.Z. Masood and C. Ellis contributed equally towards this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ellis, C., Masood, S.Z., Tappen, M.F. et al. Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition. Int J Comput Vis 101, 420–436 (2013). https://doi.org/10.1007/s11263-012-0550-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-012-0550-7

Keywords

Navigation