Recognizing Human Actions by Using Effective Codebooks and Tracking

Ballan, Lamberto; Seidenari, Lorenzo; Serra, Giuseppe; Bertini, Marco; Del Bimbo, Alberto

doi:10.1007/978-1-4471-5520-1_3

Lamberto Ballan⁶,
Lorenzo Seidenari⁶,
Giuseppe Serra⁶,
Marco Bertini⁶ &
…
Alberto Del Bimbo⁶

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

3021 Accesses

Abstract

Recognition and classification of human actions for annotation of unconstrained video sequences has proven to be challenging because of the variations in the environment, appearance of actors, modalities in which the same action is performed by different persons, speed and duration and points of view from which the event is observed. This variability reflects in the difficulty of defining effective descriptors and deriving appropriate and effective codebooks for action categorization. In this chapter, we present a novel and effective solution to classify human actions in unconstrained videos. In the formation of the codebook, we employ radius-based clustering with soft assignment in order to create a rich vocabulary that may account for the high variability of human actions. We show that our solution scores very good performance with no need of parameter tuning. We also show that a strong reduction of computation time can be obtained by applying codebook size reduction with Deep Belief Networks with little loss of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Please note that an earlier version of this work has recently appeared in IEEE Transactions on Multimedia [4].
2.
http://lastlaugh.inf.cs.cmu.edu/libscom/downloads.htm
3.
http://vision.ucsd.edu/%7epdollar/research.html
4.
http://www.irisa.fr/vista/Equipe/People/Laptev/download.html

References

Arulampalam M, Maskell S, Gordon N, Clapp T (2002) A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans Signal Process 50(2):174–188
Article Google Scholar
Bagdanov AD, Dini F, Del Bimbo A, Nunziati W (2007) Improving the robustness of particle filter-based visual trackers using online parameter adaptation. In: Proc of AVSS
Google Scholar
Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimed Tools Appl 51(1):279–302
Article Google Scholar
Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2012) Effective codebooks for human action representation and classification in unconstrained videos. IEEE Trans Multimed 14(4):1234–1245
Article Google Scholar
Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J Image Video Process 2008:246309
Article Google Scholar
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Article Google Scholar
Bregonzio M, Gong S, Xiang T (2009) Recognising action as clouds of space-time interest points. In: Proc of CVPR
Google Scholar
Cao L, Zicheng L, Huang T (2010) Cross-dataset action detection. In: Proc of CVPR
Google Scholar
Carreira Perpinan MA, Hinton GE (2005) On contrastive divergence learning. In: Proc of AISTATS
Google Scholar
Chen MY, Hauptmann AG (2009) MoSIFT: recognizing human actions in surveillance videos. Technical report, CMU
Google Scholar
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619
Article Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proc of CVPR
Google Scholar
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proc of VSPETS
Google Scholar
Efros AA, Berg AC, Mori G, Malik J (2003) Recognizing action at a distance. In: Proc of ICCV
Google Scholar
Fergus R, Perona P, Zisserman A (2003) Object class recognition by unsupervised scale-invariant learning. In: Proc of CVPR
Google Scholar
Gao Z, Chen MY, Hauptmann AG, Cai A (2010) Comparing evaluation protocols on the KTH dataset. In: Proc of HBU workshop
Google Scholar
Gorelick L, Blank M, Schechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253
Article Google Scholar
Hauptmann AG, Christel MG, Yan R (2008) Video retrieval based on semantic concepts. Proc IEEE 96(4):602–622
Article Google Scholar
Hinton EG, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Hinton EG, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article MathSciNet MATH Google Scholar
Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Trans Multimed 12(1):42–53
Article Google Scholar
Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: Proc of ICCV
Google Scholar
Kläser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: Proc of BMVC
Google Scholar
Kong Y, Zhang X, Hu W, Jia Y (2011) Adaptive learning codebook for action recognition. Pattern Recognit Lett 32(8):1178–1186
Article Google Scholar
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proc of CVPR
Google Scholar
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Article Google Scholar
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proc of CVPR
Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proc of CVPR
Google Scholar
Lin Z, Jiang Z, Davis LS (2009) Recognizing actions by shape-motion prototype trees. In: Proc of ICCV
Google Scholar
Liu J, Shah M (2008) Learning human actions via information maximization. In: Proc of CVPR
Google Scholar
Liu J, Ali S, Shah M (2008) Recognizing human actions using multiple features. In: Proc of CVPR
Google Scholar
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Proc of CVPR
Google Scholar
Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: Proc of DARPA IU workshop
Google Scholar
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: Proc of CVPR
Google Scholar
Mikolajczyk K, Uemura H (2008) Action recognition with motion-appearance vocabulary forest. In: Proc of CVPR
Google Scholar
Mikolajczyk K, Leibe B, Schiele B (2005) Local features for object class recognition. In: Proc of ICCV
Google Scholar
Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T, Van Gool L (2005) A comparison of affine region detectors. Int J Comput Vis 65(1/2):43–72
Article Google Scholar
Moeslund T, Hilton A, Krüger V (2006) A survey of advances in vision-based human motion capture and analysis. Comput Vis Image Underst 104(2–3):90–126
Article Google Scholar
Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318
Article Google Scholar
Poppe R (2007) Vision-based human motion analysis: an overview. Comput Vis Image Underst 108(1–2):4–18
Article Google Scholar
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Article Google Scholar
Rapantzikos K, Avrithis Y, Kollia S (2009) Dense saliency-based spatiotemporal feature points for action recognition. In: Proc of CVPR
Google Scholar
Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proc of ICPR
Google Scholar
Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descriptor and its application to action recognition. In: Proc of ACM multimedia
Google Scholar
Shao L, Mattivi R (2010) Feature detector and descriptor evaluation in human action recognition. In: Proc of CIVR
Google Scholar
Shao L, Gao R, Liu Y, Zhang H (2011) Transform based spatio-temporal descriptors for human action recognition. Neurocomputing 74(6):962–973
Article Google Scholar
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proc of ICCV
Google Scholar
Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proc of ACM multimedia
Google Scholar
Sun X, Chen M, Hauptmann AG (2009) Action recognition via local descriptors and holistic features. In: Proc of CVPR4HB workshop
Google Scholar
Turaga P, Chellappa R, Subrahmanian V, Udrea O (2008) Machine recognition of human activities: a survey. IEEE Trans Circuits Syst Video Technol 18(11):1473–1488
Article Google Scholar
van der Maaten L, Postma E, van den Herik H (2009) Dimensionality reduction: a comparative review. Technical report TiCC-TR 2009-005, Tilburg University
Google Scholar
van Gemert JC, Veenman CJ, Smeulders AWM, Geusebroek JM (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32(7):1271–1283
Article Google Scholar
Vezzani R, Cucchiara R (2010) Video surveillance online repository (ViSOR): an integrated framework. Multimed Tools Appl 50(2):359–380
Article Google Scholar
Wang Y, Mori G (2009) Max-margin hidden conditional random fields for human action recognition. In: Proc of CVPR
Google Scholar
Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: Proc of BMVC
Google Scholar
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Proc of ECCV
Google Scholar
Wong SF, Cipolla R (2007) Extracting spatiotemporal interest points using global information. In: Proc of ICCV
Google Scholar
Wu B, Nevatia R (2007) Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. Int J Comput Vis 75(2):247–266
Article Google Scholar
Yao A, Gall J, Van Gool L (2010) A hough transform-based voting framework for action recognition. In: Proc of CVPR
Google Scholar
Yilmaz A, Shah M (2005) Actions sketch: a novel action representation. In: Proc of CVPR
Google Scholar
Yu G, Goussies N, Yuan J, Liu Z (2011) Fast action detection via discriminative random forest voting and top-k subvolume search. IEEE Trans Multimed 13(3):507–517
Article Google Scholar
Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238
Article Google Scholar

Download references

Author information

Authors and Affiliations

Media Integration and Communication Center, University of Florence, Viale Morgagni 65, 50134, Florence, Italy
Lamberto Ballan, Lorenzo Seidenari, Giuseppe Serra, Marco Bertini & Alberto Del Bimbo

Authors

Lamberto Ballan
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Seidenari
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Serra
View author publications
You can also search for this author in PubMed Google Scholar
Marco Bertini
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Del Bimbo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lamberto Ballan .

Editor information

Editors and Affiliations

Dipartimento di Matematica e Informatica, Università di Catania, Catania, Italy
Giovanni Maria Farinella
Dipartimento di Matematica e Informatica, Università di Catania, Catania, Italy
Sebastiano Battiato
Department of Engineering, University of Cambridge, Cambridge, United Kingdom
Roberto Cipolla

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ballan, L., Seidenari, L., Serra, G., Bertini, M., Del Bimbo, A. (2013). Recognizing Human Actions by Using Effective Codebooks and Tracking. In: Farinella, G., Battiato, S., Cipolla, R. (eds) Advanced Topics in Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-5520-1_3

Download citation

DOI: https://doi.org/10.1007/978-1-4471-5520-1_3
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5519-5
Online ISBN: 978-1-4471-5520-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics