Abstract
This paper proposes a novel method to categorize the human actions with high dynamics in upper extremities. It combines generative and discriminative approaches to infer possible arm pose candidates from images and validate their action categories. The validated action can also facilitate deriving the estimated arm poses. The proposed method exploits the complementary relationship between action categorization and arm pose modeling by adopting arm pose prior of hypothetical action category to enhance modeling possible arm poses, and then applying features captured from temporal and spatial action characteristics of arm pose candidates to improve categorization. From a given visual observation, arm pose states can be estimated on a graphical model via dynamic programming under action category hypothesis, which can be validated by a trained discriminative model based on temporal arm pose words from the estimated arm pose candidates. The proposed method has been evaluated by videos of four action types from the Berkeley multimodal human action dataset with categorization success rate of 91.47 and 95.83 % for single and multiple frames, respectively, and images of three action types from the HumanEva-I dataset with categorization success rate of 96.67 %. Its arm pose modeling performance also has improvement for the actions with high dynamics in upper extremities.

















Similar content being viewed by others
References
Zhou, F., De la Torre, F., Hodgins, J.K.: Aligned cluster analysis for temporal segmentation of human motion. In: IEEE Conference on Automatic Face and Gesture Recognition, pp. 1–7 (2008)
Moeslund, T.B., Hilton, A., Kruger, V., Sigal, L.: Visual Analysis of Humans: Looking at People. Springer, New York (2011)
Eichner, M., Marin-Jimenez, M., Zisserman, A., Ferrari, V.: Articulated Human Pose Estimation and Search in (Almost) Unconstrained Still Images. ETH Zurich, D-ITET, BIWI, Technical Report. No. 272 (2010)
Yao, A., Gall, J., Fanelli, G., Gool, L.V.: Does human action recognition benefit from pose estimation? In: Proceedings of the British Machine Vision Conference, pp. 67.1–67.11 (2011)
Li, C., Yung, N.H.C.: Arm pose modeling for visual surveillance. In: WORLDCOMP Conference: Image Processing, Computer Vision, and Pattern Recognition (2012)
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: IEEE Workshop on Applications of Computer Vision, pp. 53–60 (2013)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the international conference on pattern recognition, pp. 32–36 (2004)
Laptev, I.: On space-time interest points. Int J Comput Vis 64, 107–123 (2005)
Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: Computer Vision and Pattern Recognition, pp. 1234–1241 (2012)
Gong W.: 3D motion data aided human action recognition and pose estimation. PhD thesis, Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autnoma de Barcelona (2013)
Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space–time shapes. In: IEEE International Conference on Computer Vision, pp. 1395–1402 (2005)
Saghafi, B., Rajan, D.: Human action recognition using pose-based discriminant embedding. Signal Process. Image Commun. 27, 96–111 (2012)
Veeraraghavan, A., Chellappa, R., Roy-Chowdhury, A.K.: The function space of an activity. In: Computer Vision and Pattern Recognition, pp. 959–968 (2006)
Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in time-sequential images using hidden Markov model. In: Computer Vision and Pattern Recognition, pp. 379–385 (1992)
Juang, B.H., Rabiner, L.R.: Hidden Markov models for speech recognition. Technometrics 33(3), 251–272 (1991)
Natarajan, P., Nevatia, R.: Hierarchical multi-channel hidden semi Markov graphical models for activity recognition. In: Computer Vision and Image Understanding (2012)
Elgammal, A., Shet, V., Yacoob, Y., Davis, L.S.: Learning dynamics for exemplar-based gesture recognition. In: Computer Vision and Pattern Recognition (2003)
Park, S., Aggarwal, J.K.: A hierarchical Bayesian network for event recognition of human actions and interactions. Multimed. Syst. 10, 164–179 (2004)
Niebles, J.C., Wang, H., Li, F.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79, 299–318 (2008)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Wang, L., Yung, N.H.C.: Three-dimensional model-based human detection in crowded scenes. IEEE Trans. Intell. Transp. Syst. 13(2), 691–703 (2012)
Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. GESTS Int. Trans. Comput. Sci. Eng. 32, 47–58 (2006)
Scarrott, C., MacDonald, A.: A review of extreme value threshold estimation and uncertainty quantification. REVSTAT Stat. J. 10, 33–60 (2012)
Davison, A.C., Smith, R.L.: Models for exceedances over high thresholds. J. R. Stat. Soc. Ser. B (Methodol.) 52(3), 393–442 (1990)
Martin, D.R., Fowlkes, C.C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal. Mach. Intell. 26, 530–549 (2004)
Wang, L., Yung, N.H.C.: Extraction of moving objects from their background based on multiple adaptive thresholds and boundary evaluation. IEEE Trans. Intell. Transp. Syst. 11, 40–51 (2010)
Conaire, C.O., O’Connor, N.E., Smeaton, A.F.: Detector adaptation by maximising agreement between independent data sources. In: Computer Vision and Pattern Recognition, pp. 1–6 (2007)
Ojala, T., Pietikinen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recognit. 29, 51–59 (1996)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)
Malik, J., Belongie, S., Leung, T., Shi, J.: Contour and texture analysis for image segmentation. Int. J. Comput. Vis. 43, 7–27 (2001)
Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing features: efficient boosting procedures for multiclass object detection. Comput. Vis. Pattern Recognit. 2, 762–769 (2004)
Ladicky, L., Russell, C., Kohli, P., Torr, P.H.: Associative hierarchical crfs for object class image segmentation. In: International Conference on Computer Vision, pp. 739–746 (2009)
Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: European Conference on Computer Vision, pp. 1–15. Springer, New York (2006)
Felzenszwalb, P.F., Zabih, R.: Dynamic programming and graph algorithms in computer vision. IEEE Trans. Pattern Anal. Mach. Intell. 33, 721–740 (2011)
Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, pp. 886–893 (2005)
Sigal, L., Black, M.J.: Humaneva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion, vol. 120. Brown Univertsity TR (2006)
Zuffi, S., Romero, J., Schmid, C., Black, M.J.: Estimating human pose with flowing puppets. In: IEEE Intenational Conference on Computer Vision (2013)
Yang Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: Computer Vision and Pattern Recognition, pp. 1385–1392 (2011)
Tilley, A., Associates, H.D.: The Measure of Man and Woman: Human Factors in Design. Wiley, New York (2002)
Prak, D., Ramanan, D.: \(N\)-best maximal decoders for part models. In: IEEE International Conference on Computer Vision (2011)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, C., Yung, N.H.C. Categorization of human actions with high dynamics in upper extremities based on arm pose modeling. Machine Vision and Applications 26, 619–632 (2015). https://doi.org/10.1007/s00138-015-0686-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00138-015-0686-x