Abstract
The ability to recognize human actions using a single viewpoint is affected by phenomena such as self-occlusions or occlusions by other objects. Incorporating multiple cameras can help overcome these issues. However, the question remains how to efficiently use information from all viewpoints to increase performance. Researchers have reconstructed a 3D model from multiple views to reduce dependency on viewpoint, but this 3D approach is often computationally expensive. Moreover, the quality of each view influences the overall model and the reconstruction is limited to volumes where the views overlap. In this paper, we propose a novel method to efficiently combine 2D data from different viewpoints. Spatio-temporal features are extracted from each viewpoint and then used in a bag-of-words framework to form histograms. Two different sizes of codebook are exploited. The similarity between the obtained histograms is represented via the Histogram Intersection kernel as well as the RBF kernel with \(\chi ^2\) distance. Lastly, we combine all the basic kernels generated by selection of different viewpoints, feature types, codebook sizes and kernel types. The final kernel is a linear combination of basic kernels that are properly weighted based on an optimization process. For higher accuracy, the sets of kernel weights are computed separately for each binary SVM classifier. Our method not only combines the information from multiple viewpoints efficiently, but also improves the performance by mapping features into various kernel spaces. The efficiency of the proposed method is demonstrated by testing on two commonly used multi-view human action datasets. Moreover several experiments indicate the efficacy of each part of the method on the overall performance.













Similar content being viewed by others
Notes
Note that \(\alpha _i\ne 0\) only for support vectors.
The dataset is accessible via http://4drepository.inrialpes.fr/public/viewgroup/6.
The actions are standing still, clapping, waving one arm, waving two arms, punching, jogging, jumping jack, kicking, bending and bowling.
References
Ashraf N, Sun C, Foroosh H (2014) View invariant action recognition using projective depth. Comput Vis Image Underst 123:41–52
Atrey PK, Hossain MA, Saddik AE, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Syst 16:345–379
Barla A, Odone F, Verri A (2003) Histogram intersection kernel for image classification. In: International conference on image processing
Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition using shape contexts. IEEE Trans Pattern Anal Mach Intell 24(4):509–522
Bobick A, Davis J (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Wu C, Khalili AH, Aghajan H (2010) Multiview activity recognition in smart homes with spatio-temporal features. In: ACM/IEEE international conference on distributed smart cameras (2010)
Cai Z, Wang L, Peng X, Qiao Y (2014) Multi-view super vector for action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Chapelle O, Haffner P, Vapnik VN (1999) Support vector machines for histogram-based image classification. IEEE Trans Neural Netw 10(5):1055–1064
Cheng SY, Trivedi MM (2007) Articulated human body pose inference from voxel data using a kinematically constrained gaussian mixture model. In: CVPR Workshops
Cortes C, Gretton A, Lanckriet G, Mohri M, Rostamizadeh A (2008) Automatic selection of optimal kernels. In: Proceedings of the NIPS workshop on Kernel learning
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: International workshop on performance evaluation of tracking and surveillance, ICCV
Farhadi A, Tabrizi M (2008) Learning to recognize activities from the wrong view point. In: European conference on computer vision (ECCV)
Fu H, Qiu G, He H (2011) Feature combination beyond basic arithmetics. In: British machine vision conference (BMVC)
Gehler P, Nowozin S (2009) On feature combination for multiclass object classification. In: International conference on computer vision (ICCV)
Gkalelis N, Nikolaidis N, Pitas I (2009) View indepedent human movement recognition from multi-view video exploiting a circular invariant posture representation. In: IEEE international conference on multimedia and expo
Holte MB, Moeslund T, Nikolaidis N, Pitas I (2011) 3D human action recognition for multi-view camera systems. In: International conference on 3D imaging, modeling, processing, visualization and transmission (3DIMPVT)
Holte MB, Tran C, Trivedi MM, Moeslund TB (2012) Human pose estimation and activity recognition from multi-view videos: comparative explorations of recent developments. IEEE J Sel Top Sign Process 6(5):538–552
Huang P, Hilton A, Starck J (2010) Shape similarity for 3d video sequences of people. Int J Comput Vis 89(2–3):362–381
Jhuo IH, Lee DT (2010) Boosted multiple kernel learning for scene category recognition. In: International conference on pattern recognition (ICPR)
Junejo IN, Dexter E, Laptev I, Pérez P (2008) Cross-view action recognition from temporal self-similarities. In: European conference on computer vision (ECCV)
Kloft M, Brefeld U, Sonnenburg S, Zien A (2011) Lp-norm multiple kernel learning. J Mach Learn Res 12:953–997
Laptev I (2005) On space-time interest points. Int. J. Comput Vis 64(2):107–123
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Liu J, Ali S, Shah M (2008) Recognizing human actions using multiple features. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Liu J, Shah M (2008) Learning human action via information maximization. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Lv F, Nevatia R (2007) Single view human action recognition using key pose matching and viterbi path searching. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Maji S, Berg A, Malik J (2008) Classification using intersection kernel support vector machines is efficient. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Matikainen P, Pillai P, Mummert L, Sukthankar R, Hebert M (2011) Prop-free pointing detection in dynamic cluttered environments. In: IEEE international conference on automatic face and gesture recognition and workshops
Naiel M, Abdelwahab M, El-Saban M (2011) Multi-view human action recognition system employing 2dpca. In: Workshop on applications of computer vision (WACV)
Pehlivan S, Duygulu P (2010) A new pose-based representation for recognizing actions from multiple cameras. Comput Vis Image Underst 115:140–151
Pehlivan S, Forsyth DA (2014) Recognizing activities in multiple views with fusion of frame judgments. Image Vis Comput 32(4):237–249
Peng B, Qian G (2011) Online gesture spotting from visual hull data. IEEE Trans Pattern Anal Mach Intell 33(6):1175–1188
Rakotomamonjy A, Bach F, Canu S, Grandvalet Y (2007) More efficiency in multiple kernel learning. In: International conference on machine learning (ICML)
Ramagiri S, Kavi R, Kulathumani V (2011) Real-time multi-view action recognition using a wireless camera network. In: ACM/IEEE international conference on distributed smart cameras
Reddy K, Liu J, Shah M (2009) Incremental action recognition using feature-tree. In: International conference on computer vision (ICCV)
Song Y, Demirdjian D, Davis R (2011) Multi-signal gesture recognition using temporal smoothing hidden conditional random fields. In: IEEE international conference on automatic face and gesture recognition and workshops
Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565
Souvenir R, Babbs J (2008) Learning the viewpoint manifold for action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Sun S (2013) A survey of multi-view machine learning. Neural Comput Appl 23(7–8):2031–2038
Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7(1):11–32
Turaga P, Veeraraghavan A, Chellappa R (2008) Statistical analysis on stiefel and grassmann manifolds with applications in computer vision. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Varma M, Ray D (2007) Learning the discriminative power-invariance trade-off. In: International conference on computer vision (ICCV)
Veeraraghavan A, Srivastava A, Roy-Chowdhury A, Chellappa R (2009) Rate-invariant recognition of humans and their activities. IEEE Trans Image Process 18(6):1326–1339
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a lie group. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Vitaladevuni S, Kellokumpu V, Davis L (2008) Action recognition using ballistic dynamics. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Weinland D, Boyer E, Ronfard R (2007) Action recognition from arbitrary views using 3D exemplars. In: International conference on computer vision (ICCV)
Weinland D, Özuysal M, Fua P (2010) Making action recognition robust to occlusions and viewpoint changes. In: European conference on computer vision (ECCV)
Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104:249–257
Yan P, Khan S, Shah M (2008) Learning 4d action feature models for arbitrary view action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Saghafi, B., Rajan, D. & Li, W. Efficient 2D viewpoint combination for human action recognition. Pattern Anal Applic 19, 563–577 (2016). https://doi.org/10.1007/s10044-016-0537-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-016-0537-z