Skip to main content
Log in

Robust human action recognition scheme based on high-level feature fusion

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents our research on the human action recognition which employs different low-level local and spatio-temporal descriptors. The motivation is that these descriptors emphasize different aspects of actions. We investigate a generic approach applied to different periodic and non-periodic actions in the same framework defined by Weizmann and KTH datasets. So, we explore the notion of self-similarity descriptor over time. Then, non-linear χ 2 kernel-based Support Vector Machines are used to perform classification. Individual actions are modeled independently. Finally, classifier outputs are fused using our proposed neural network based on Evidence theory method, trying to improve the classification rate by pushing classifiers into an optimized structure. Experimental results report the efficiency and the significant improvement of the proposed scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Artts: Action Recognition and Tracking based on Time-of-flight Sensors. Website: http://www.artts.eu/

  2. Co-friend: Cognitive & Flexible learning system operating Robust Interpretation of ExteNded real scenes by multi-sensor Datafusion. Website: http://easaier.silogic.fr/co-friend/

  3. Quaero: “I seek” in Latin. Website: http://quaero.org/modules/movie/scenes/home/

  4. Scovis: Self-Configurable Cognitive Video Supervision. Website: http://grid.ece.ntua.gr/?projects=scovis/

  5. http://opencv.willowgarage.com/wiki/

  6. Other clustering techniques can be used. Spectral clustering has emerged recently as a popular clustering method that uses eigenvectors of a matrix derived from the data. Several algorithms have been proposed in the literature [4, 43], each using the eigenvectors in slightly different ways. In this paper, we decided to use a sophisticated K-means provided by the last version of the INRIA-Yael library (website: https://gforge.inria.fr/projects/yael) because of its simplicity and efficiency. Efficient tools for basic computationally demanding tasks are proposed.

  7. Other challenging collections are available like TRECVid (http://trecvid.nist.gov/), Hollywood2D (http://www.irisa.fr/vista/actions/hollywood2/), and Scovis (http://www.scovis.eu/). To our knowledge this last is a novel dataset in the community in that it involves video sequences from the production line of a major automobile manufacturer. It contains footage captured by the static camera from oblique views, and in totally uncontrolled industrial environment [20, 70]. We can observe sparks and vibrations, oscillations of the cameras, upright racks and heavy occlusions of workers in most part of image and cluttered background, welding machines and forklifts, etc. This dataset presents more complexity and difficulties, mainly because of the last observations. Several preprocessing steps need to be included in our approach such as: background filtering, trajectory filtering and object detection with a geometrical filtering.

  8. The balanced error rate is the average of the errors on each class. BER is used in “Performance Prediction Challenge Workshop”.

  9. http://www.intel.com/technology/computing/opencv/

  10. http://www.irisa.fr/vista/Equipe/People/Laptev/download.html

References

  1. Aggarwal J, Cai Q (1997) Human motion analysis: a review. In: Proceedings of IEEE nonrigid and articulated motion workshop, pp 90–102

  2. Aggarwal J, Park S (2004) Human motion: modeling and recognition of actions and interactions. In: 3D data processing visualization and transmission, pp 640–647

  3. Ahmad M, Lee S-W (2006) HMM-based human action recognition using multi-view image sequences. In: International conference on pattern recognition, pp 263–266

  4. Bach FR, Jordan MI (2003) Learning spectral clustering. Adv Neural Inf Process Syst, 305–312

  5. Benmokhtar R, Huet B (2006) Classifier fusion: combination methods for semantic indexing in video content. In: Proceedings of international conference on artificial neural networks, pp 65–74

  6. Benmokhtar R, Huet B (2007) Neural network combining classifier based on Dempster–Shafer theory for semantic indexing in video content. In: International multimedia modeling conference, pp 196–205

  7. Bobick A-F, Davis J-W (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23:257–267

    Article  Google Scholar 

  8. Bregonzio M, Gong S, Xiang T (2009) Recognising action as clouds of space-time interest points. In: IEEE computer vision and pattern recognition

  9. Castellanos R, Kalva H, Marques O, Furht B (2010) Event detection in video using motion analysis. In: ACM international workshop on analysis and retrieval of tracked events and motion in imagery streams, pp 57–62

  10. Chapelle O, Haffner P, Vapnik V (1999) Support vector machines for histogram-based image classification. IEEE Trans Neural Netw 10(5):1055–1064

    Article  Google Scholar 

  11. Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE conference on computer vision and pattern recognition, pp 1932–1939

  12. Chua T-W, Nam K-L, Pham T (2011) Human action recognition via sum-rule fusion of fuzzy K-nearest neighbor classifiers. In: IEEE international conference on fuzzy systems, pp 484–489

  13. Cilla R, Patricio M, Berlanga A, Molina J (2010) Fusion of single view soft k-NN classifiers for multi-camera human action recognition. Hybrid Artif Intell Syst 6077:436–443

    Article  Google Scholar 

  14. Dempster A (1967) Upper and lower probabilities induced by multivalued mapping. Ann Math Stat AMS 38:325339

    MathSciNet  Google Scholar 

  15. Denoeux T (2000) A neural network classifier based on Dempster–Shafer theory. IEEE Trans Syst Man Cybern 30(2):131–150

    Article  MathSciNet  Google Scholar 

  16. Denoeux T, Kennes G (1996) Combined supervised and unsupervised learning for system diagnostic using Dempster–Shafer theory. In: Multi conference computational engineering applications. Symposium on control, optimization and supervision, vol 1, pp 104–109

  17. Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 65–72

  18. Doulamis N, Doulamis A (2006) Evaluation of relevance feedback schemes in content-based in retrieval systems. Signal Process Image Commun 21:334–357

    Article  Google Scholar 

  19. Doulamis A, Doulamis N, Akrivas G, Kollias S (2000) Non-sequential video content representation using temporal variation of feature vectors. IEEE Trans Consum Electron 46(3):758–768

    Article  Google Scholar 

  20. Doulamis N-D, Voulodimos A-S, Kosmopoulos D-I, Varvarigou T-A (2010) Enhanced human behavior recognition using HMM and evaluative rectification. In: ACM international workshop on analysis and retrieval of tracked events and motion in imagery streams, pp 39–44

  21. Duchenne O, Laptev I, Sivic J, Bach F, Ponce J (2009) Automatic annotation of human actions in video. In: IEEE international conference on computer vision, pp 1491–1498

  22. Efros A-A, Berg A-C, Mori G, Malik J (2003) Recognizing action at a distance. In: IEEE international conference on computer vision, vol 2, pp 726–733

  23. Fablet R, Bouthemy P (2003) Motion recognition using non-parametric image motion models estimated from temporal and multiscale co-occurrence statistics. IEEE Trans Pattern Anal Mach Intell 25:1619–1624

    Article  Google Scholar 

  24. Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE conference on computer vision, pp 524–531

  25. Garcia E (2006) Cosine similarity and term weight tutorial. In: Information retrieval intelligence

  26. Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253

    Article  Google Scholar 

  27. Horn BKP, Schunck BG (1981) Determining optical flow. Artif Intell 17:185–203

    Article  Google Scholar 

  28. Hu W, Tan T, Wang L, Maybank S (2004) A survey on visual surveillance of object motion and behaviors. IEEE Trans Syst Man Cybern, Part C Appl Rev 34(3):334–352

    Article  Google Scholar 

  29. Jacobs R, Jordan M, Nowlan S, Hinton G (1991) Adaptive mixtures of local experts. Neural Comput 3:1409–1431

    Article  Google Scholar 

  30. Jia K, Yeung D-Y (2008) Human action recognition using local spatio-temporal discriminant embedding. In: IEEE conference on computer vision & pattern recognition, pp 1–8

  31. Jianguo Z, Marszalek M, Lazebnik S, Schmid C (2006) Local features and kernels for classification of texture and object categories: a comprehensive study. In: Proceedings of the conference on computer vision and pattern recognition workshop, vol 10, no 5, p 13

  32. Junejo I-N, Dexter E, Laptev I, Pérez P (2008) Cross-view action recognition from temporal self-similarities. In: Proceedings of the European conference on computer vision, pp 293–306

  33. Junejo I, Dexter E, Laptev I, Pérez P (2011) View-independent action recognition from temporal self-similarities. IEEE Trans Pattern Anal Mach Intell 33(1):172–185

    Article  Google Scholar 

  34. Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: IEEE conference on computer vision, vol 1, pp 604–610

  35. Kadir T, Brady M (2001) Saliency, scale and image description. Int J Comput Vis 2:83–105

    Article  Google Scholar 

  36. Krausz B, Herpers R (2010) MetroSurv: detecting events in subway stations. Multimedia Tools Appl 50:123–147

    Article  Google Scholar 

  37. Kuncheva LI (2001) Using measures of similarity and inclusion for multiple classifier fusion by decision templates. Fuzzy Sets Syst 122:401–407

    Article  MATH  MathSciNet  Google Scholar 

  38. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2/3):107–123

    Article  Google Scholar 

  39. Laptev I, Marszalek M, Marcin S, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision & pattern recognition, vol 3, pp 1–8

  40. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE conference on computer vision and pattern recognition, vol 2, pp 2169–2178

  41. Liu J, Ali S, Shah M (2008) Recognizing human actions using multiple features. In: International conference on computer vision, pp 1–8

  42. Lucas B-D, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: Proceedings of the international joint conference on artificial intelligence, pp 674–679

  43. Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416

    Article  MathSciNet  Google Scholar 

  44. Mahbub U, Imtiaz H, Ahad A-R (2011) An optical flow-based action recognition algorithm. In: IEEE conference on computer vision and pattern recognition

  45. Marzat J, Dumortier Y, Ducrot A (2008) Real-time dense and accurate parallel optical flow using CUDA. In: International workshop on computer vision and its application to image media processing

  46. Matikainen P, Hebert M, Sukthankar R (2009) Trajectons: action recognition through the motion analysis of tracked features. In: Proceedings of the workshop on video-oriented object and event classification

  47. Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: International conference on computer vision

  48. Mikolajczyk K, Uemura H (2008) Action recognition with motion-appearance vocabulary forest. In: International conference on computer vision, pp 1–8

  49. Murphy K-O (2002) Dynamic bayesian networks: representation, inference and learning. PhD thesis, UC Berkeley, Computer Science Division

  50. Niebles J-C, Wang H, Fei-Fei L (2006) Unsupervised learning of human action categories using spatial-temporal words. In: British machine vision conference

  51. Noguchi A, Yanai K (2009) A SURF-based spatio-temporal feature for feature-fusion-based action recognition. In: Proceedings on human motion: understanding, modeling, capture and animation

  52. Ntalianis K-S, Doulamis A-D, Tsapatsoulis N, Doulamis N (2010) Human action annotation, modeling and analysis based on implicit user interaction. Multimedia Tools Appl 50:199–225

    Article  Google Scholar 

  53. Oikonomopoulos A, Patras I, Pantic M (2006) Spatiotemporal salient points for visual recognition of human actions. IEEE Trans Syst Man Cybern 3:710–719

    Google Scholar 

  54. Perronnin F, Sánchez J, Mensink T (2010) Improving the Fisher kernel for Large-scale image classification. In: European conference on computer vision, pp 143–156

  55. Pundlik S-J, Birchfield S-T (2008) Real-time motion segmentation of sparse feature points at any speed. IEEE Trans Syst Man Cybern, Part B Cybern 38(3):731–742

    Article  Google Scholar 

  56. Ramasso E, Pellerin D, Panagiotakis C, Rombaut M, Tziritas G, Lim W (2005) Spatio-temporal information fusion for human action recognition in videos. In: European signal processing conference

  57. Roth D, Koller-Meier E, Van Gool J-L (2010) Multi-object tracking evaluated on sparse events. Multimedia Tools Appl 50:29–47

    Article  Google Scholar 

  58. Scholkopf B, Smola A (2002) Learning with kernels: support vector machines. In: Regularization, optimization and beyond. MIT Press, Cambridge

    Google Scholar 

  59. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the international conference on pattern recognition, vol 3, pp 32–36

  60. Seidenari L, Bertini M, Del Bimbo A (2010) Dense spatio-temporal features for non-parametric anomaly detection and localization. In: ACM international workshop on analysis and retrieval of tracked events and motion in imagery streams, pp 27–32

  61. Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton

    MATH  Google Scholar 

  62. Shechtman E, Irani M (2007) Matching local self-similarities across images and videos. In: IEEE conference on computer vision and pattern recognition, pp 1–8

  63. Shi J, Tomasi C (1994) Good features to track. In: IEEE conference on computer vision and pattern recognition, pp 593–600

  64. Simon C, Meessen J, De Vleeschouwer C (2010) Visual event recognition using decision trees. Multimedia Tools Appl 50:95–121

    Article  Google Scholar 

  65. Smets P, Kennes R (1994) The transferable belief model. Artif Intell 66(2):191234

    Article  MathSciNet  Google Scholar 

  66. Stokman H, Gevers T (2007) Selection and fusion of color models for image feature detection. IEEE Trans Pattern Anal Mach Intell 29:371–381

    Article  Google Scholar 

  67. Sun X, Chen M, Hauptman A (2009) Action recognition via local descriptors and holistic features. In: International conference on computer vision

  68. Sun B-Y, Zhang X-M, Li J, Mao X-M (2010) Feature fusion using locally linear embedding for classification. IEEE Trans Neural Netw 21:163–168

    Article  Google Scholar 

  69. Sun J, Wu X, Yan S, Cheong L-F, Chua T-S, Li J (2011) Action recognition by dense trajectories. In: IEEE computer vision and pattern recognition

  70. Voulodimos A, Kosmopoulos D, Vasileiou G, Sardis E, Doulamis A, Anagnostopoulos V, Lalos C, Varvarigou T (2011) A dataset for workflow recognition in industrial scenes. Proc Int Conf Image Proc 30(2):3249–3252

    Google Scholar 

  71. Waltisberg D, Yao A, Gall J, Van-Gool L (2010) Variations of a hough-voting action recognition system. In: Proceedings of the ICPR contests, pp 306–312

  72. Wang J-J, Singh S (2003) Video analysis of human dynamics—a survey. Real-Time Imag 9(3):321–346

    Article  Google Scholar 

  73. Wang L, Zhou H, Low S-C, Leckie C (2009) Action recognition via multi-feature fusion and Gaussian process classification. In: Workshop on applications of computer vision, pp 1–6

  74. Wang H, Ullah M-M, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: British machine vision conference, p 127

  75. Wang H, Klaser A, Schmid C, Liu C-L (2009) Hierarchical spatio-temporal context modeling for action recognition. In: IEEE computer vision and pattern recognition

  76. Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Special issue on Modeling People: Vision-Based Understanding of a Person’s Shape, Appearance, Movement and Behaviour. Comput Vis Image Underst 104(2–3):249–257

    Article  Google Scholar 

  77. Willems G, Tuytelaars T, Gool L-V (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: International conference on computer vision

  78. Wong S-F, Cipolla R (2007) Extracting spatio-temporal interest points using global information. In: IEEE conference on computer vision and pattern recognition. pp 1–8

  79. Xu L, Krzyzak A, Suen C (1992) Methods of combining multiple classifiers and their application to handwriting recognition. IEEE Trans Syst Man Cybern 22:418–435

    Article  Google Scholar 

  80. Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 1794–1801

  81. Zouhal L, Denoeux T (1995) An adaptive k-NN rule based on Dempster–Shafer theory. In: International conference on computer analysis of images and patterns, vol 4351, pp 310–317

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rachid Benmokhtar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Benmokhtar, R. Robust human action recognition scheme based on high-level feature fusion. Multimed Tools Appl 69, 253–275 (2014). https://doi.org/10.1007/s11042-012-1022-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-012-1022-3

Keywords

Navigation