Abstract
Recognition and classification of human actions for annotation of unconstrained video sequences has proven to be challenging because of the variations in the environment, appearance of actors, modalities in which the same action is performed by different persons, speed and duration and points of view from which the event is observed. This variability reflects in the difficulty of defining effective descriptors and deriving appropriate and effective codebooks for action categorization. In this chapter, we present a novel and effective solution to classify human actions in unconstrained videos. In the formation of the codebook, we employ radius-based clustering with soft assignment in order to create a rich vocabulary that may account for the high variability of human actions. We show that our solution scores very good performance with no need of parameter tuning. We also show that a strong reduction of computation time can be obtained by applying codebook size reduction with Deep Belief Networks with little loss of accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Please note that an earlier version of this work has recently appeared in IEEE Transactions on Multimedia [4].
- 2.
- 3.
- 4.
References
Arulampalam M, Maskell S, Gordon N, Clapp T (2002) A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans Signal Process 50(2):174–188
Bagdanov AD, Dini F, Del Bimbo A, Nunziati W (2007) Improving the robustness of particle filter-based visual trackers using online parameter adaptation. In: Proc of AVSS
Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimed Tools Appl 51(1):279–302
Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2012) Effective codebooks for human action representation and classification in unconstrained videos. IEEE Trans Multimed 14(4):1234–1245
Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J Image Video Process 2008:246309
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Bregonzio M, Gong S, Xiang T (2009) Recognising action as clouds of space-time interest points. In: Proc of CVPR
Cao L, Zicheng L, Huang T (2010) Cross-dataset action detection. In: Proc of CVPR
Carreira Perpinan MA, Hinton GE (2005) On contrastive divergence learning. In: Proc of AISTATS
Chen MY, Hauptmann AG (2009) MoSIFT: recognizing human actions in surveillance videos. Technical report, CMU
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proc of CVPR
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proc of VSPETS
Efros AA, Berg AC, Mori G, Malik J (2003) Recognizing action at a distance. In: Proc of ICCV
Fergus R, Perona P, Zisserman A (2003) Object class recognition by unsupervised scale-invariant learning. In: Proc of CVPR
Gao Z, Chen MY, Hauptmann AG, Cai A (2010) Comparing evaluation protocols on the KTH dataset. In: Proc of HBU workshop
Gorelick L, Blank M, Schechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253
Hauptmann AG, Christel MG, Yan R (2008) Video retrieval based on semantic concepts. Proc IEEE 96(4):602–622
Hinton EG, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hinton EG, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Trans Multimed 12(1):42–53
Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: Proc of ICCV
Kläser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: Proc of BMVC
Kong Y, Zhang X, Hu W, Jia Y (2011) Adaptive learning codebook for action recognition. Pattern Recognit Lett 32(8):1178–1186
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proc of CVPR
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proc of CVPR
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proc of CVPR
Lin Z, Jiang Z, Davis LS (2009) Recognizing actions by shape-motion prototype trees. In: Proc of ICCV
Liu J, Shah M (2008) Learning human actions via information maximization. In: Proc of CVPR
Liu J, Ali S, Shah M (2008) Recognizing human actions using multiple features. In: Proc of CVPR
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Proc of CVPR
Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: Proc of DARPA IU workshop
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: Proc of CVPR
Mikolajczyk K, Uemura H (2008) Action recognition with motion-appearance vocabulary forest. In: Proc of CVPR
Mikolajczyk K, Leibe B, Schiele B (2005) Local features for object class recognition. In: Proc of ICCV
Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T, Van Gool L (2005) A comparison of affine region detectors. Int J Comput Vis 65(1/2):43–72
Moeslund T, Hilton A, Krüger V (2006) A survey of advances in vision-based human motion capture and analysis. Comput Vis Image Underst 104(2–3):90–126
Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318
Poppe R (2007) Vision-based human motion analysis: an overview. Comput Vis Image Underst 108(1–2):4–18
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Rapantzikos K, Avrithis Y, Kollia S (2009) Dense saliency-based spatiotemporal feature points for action recognition. In: Proc of CVPR
Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proc of ICPR
Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descriptor and its application to action recognition. In: Proc of ACM multimedia
Shao L, Mattivi R (2010) Feature detector and descriptor evaluation in human action recognition. In: Proc of CIVR
Shao L, Gao R, Liu Y, Zhang H (2011) Transform based spatio-temporal descriptors for human action recognition. Neurocomputing 74(6):962–973
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proc of ICCV
Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proc of ACM multimedia
Sun X, Chen M, Hauptmann AG (2009) Action recognition via local descriptors and holistic features. In: Proc of CVPR4HB workshop
Turaga P, Chellappa R, Subrahmanian V, Udrea O (2008) Machine recognition of human activities: a survey. IEEE Trans Circuits Syst Video Technol 18(11):1473–1488
van der Maaten L, Postma E, van den Herik H (2009) Dimensionality reduction: a comparative review. Technical report TiCC-TR 2009-005, Tilburg University
van Gemert JC, Veenman CJ, Smeulders AWM, Geusebroek JM (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32(7):1271–1283
Vezzani R, Cucchiara R (2010) Video surveillance online repository (ViSOR): an integrated framework. Multimed Tools Appl 50(2):359–380
Wang Y, Mori G (2009) Max-margin hidden conditional random fields for human action recognition. In: Proc of CVPR
Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: Proc of BMVC
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Proc of ECCV
Wong SF, Cipolla R (2007) Extracting spatiotemporal interest points using global information. In: Proc of ICCV
Wu B, Nevatia R (2007) Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. Int J Comput Vis 75(2):247–266
Yao A, Gall J, Van Gool L (2010) A hough transform-based voting framework for action recognition. In: Proc of CVPR
Yilmaz A, Shah M (2005) Actions sketch: a novel action representation. In: Proc of CVPR
Yu G, Goussies N, Yuan J, Liu Z (2011) Fast action detection via discriminative random forest voting and top-k subvolume search. IEEE Trans Multimed 13(3):507–517
Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag London
About this chapter
Cite this chapter
Ballan, L., Seidenari, L., Serra, G., Bertini, M., Del Bimbo, A. (2013). Recognizing Human Actions by Using Effective Codebooks and Tracking. In: Farinella, G., Battiato, S., Cipolla, R. (eds) Advanced Topics in Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-5520-1_3
Download citation
DOI: https://doi.org/10.1007/978-1-4471-5520-1_3
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5519-5
Online ISBN: 978-1-4471-5520-1
eBook Packages: Computer ScienceComputer Science (R0)