Fusing depth and colour information for human action recognition

Avola, Danilo; Bernardi, Marco; Foresti, Gian Luca

doi:10.1007/s11042-018-6875-7

Fusing depth and colour information for human action recognition

Published: 27 November 2018

Volume 78, pages 5919–5939, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Danilo Avola^1,2,
Marco Bernardi² &
Gian Luca Foresti¹

706 Accesses
31 Citations
Explore all metrics

Abstract

In recent years, human action recognition systems have been increasingly developed to support a wide range of application areas, such as surveillance, behaviour analysis, security, and many others. In particular, data fusion approaches that use depth and colour information (i.e., RGB-D data) seem to be particularly promising for recognizing large classes of human actions with a high level of accuracy. Anyway, existing data fusion approaches are mainly based on feature fusion strategies, which tend to suffer of some limitations, including the difficult of combining different feature types and the management of missing information. To address the two problems just reported, we propose an RGB-D data based human action recognition system supported by a decision fusion strategy. The system, starting from the well-known Joint Directors of Laboratories (JDL) data fusion model, analyses human actions separately for each channel (i.e., depth and colour). The actions are modelled as a sum of visual words by using the traditional Bag-of-Visual-Words (BoVW) model. Subsequently, on each channel, these actions are classified by using a multi-class Support Vector Machine (SVM) classifier. Finally, the classification results are fused by a Naive Bayes Combination (NBC) method. The effectiveness of the proposed system has been proven on the basis of three public datasets: UTKinect-Action3D, CAD-60, and LIRIS Human Activities. Experimental results, compared with key works of the current state-of-the-art, have shown that what we propose can be considered a concrete contribute to the action recognition field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Better Performance in Human Action Recognition from Spatiotemporal Depth Information Features Classification

Hybrid Multi-modal Fusion for Human Action Recognition

Multimodal Body Sensor for Recognizing the Human Activity Using DMOA Based FS with DL

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Aggarwal J, Ryoo M (2011) Human activity analysis: a review. ACM Comput Surv 43(3):16, 1–16, 43
Article Google Scholar
Aggarwal J, Xia L (2014) Human activity recognition from 3D data: a review. Pattern Recogn Lett 48:70–80
Article Google Scholar
Avola D, Cinque L, Levialdi S, Placidi G (2013) Human body language analysis: a preliminary study based on kinect skeleton tracking. In: Proceedings of the international conference on image analysis and processing (ICIAP), pp 465–473
Chapter Google Scholar
Avola D, Bernardi M, Cinque L, Foresti GL, Massaroni C (2018a) Combining keypoint clustering and neural background subtraction for real-time moving object detection by PTZ cameras. In: Proceedings of the international conference on pattern recognition applications and methods (ICPRAM), pp 638–645
Avola D, Bernardi M, Cinque L, Foresti GL, Massaroni C (2018b) Exploiting recurrent neural networks and leap motion controller for the recognition of sign language and semaphoric hand gestures. IEEE Transactions on Multimedia, pp P–P (in press)
Avola D, Cinque L, Foresti G, Martinel N, Pannone D, Piciarelli C (2018c) Low-level feature detectors and descriptors for smart image and video analysis: a comparative study. In: Bridging the semantic gap in image and video analysis, pp 7–29
Google Scholar
Avola D, Cinque L, Foresti GL, Marini MR, Pannone D (2018d) VRheab: a fully immersive motor rehabilitation system based on recurrent neural network. Multimedia Tools and Applications 77(19):24, 955–24, 982
Article Google Scholar
Ballan L, Bertini M, Del Bimbo A, Serra G (2010) Video event classification using string kernels. Multimedia Tools and Applications 48(1):69–87
Article Google Scholar
Bay H, Ess A, Tuytelaars T, Gool LV (2008) Speeded-up robust features (SURF). Comput Vis Image Underst 110(3):346–359
Article Google Scholar
Benmokhtar R (2014) Robust human action recognition scheme based on high-level feature fusion. Multimedia Tools and Applications 69(2):253–275
Article Google Scholar
Canal G, Escalera S, Angulo C (2016) A real-time human-robot interaction system based on gestures for assistive scenarios. Comput Vis Image Underst 149(C):65–77
Article Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
Article Google Scholar
Chathuramali KGM, Rodrigo R (2012) Faster human activity recognition with SVM. In: Proceedings of the international conference on advances in ICT for emerging regions (ICTer), pp 197–203
Cámara-Chávez G, de Albuquerque Araújo A (2009) Harris-SIFT descriptor for video event detection based on a machine learning approach. In: Proceedings of the IEEE international symposium on multimedia (ISM), pp 153–158
Correa NM, Adali T, Li YO, Calhoun VD (2010) Canonical correlation analysis for data fusion and group inferences. IEEE Signal Proc Mag 27(4):39–50
Article Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Das S, Koperski M, Bremond F, Francesca G (2017) Action recognition based on a mixture of RGB and depth based skeleton. In: Proceedings of the IEEE international conference on advanced video and signal based surveillance (AVSS), pp 1–6
Duta IC, Uijlings JRR, Ionescu B, Aizawa K, Hauptmann AG, Sebe N (2017) Efficient human action recognition using histograms of motion gradients and VLAD with descriptor shape information. Multimedia Tools and Applications 76(21):22, 445-22, 472
Article Google Scholar
Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2650–2658
Fei-Fei L, Perona P (2005) A Bayesian hierarchical model for learning natural scene categories. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), vol 2, pp 524–531
Foggia P, Percannella G, Saggese A, Vento M (2013) Recognizing human actions by a bag of visual words. In: Proceedings of the IEEE international conference on systems, man, and cybernetics (SMC), pp 2910–2915
Gao Y, Xiang X, Xiong N, Huang B, Lee HJ, Alrifai R, Jiang X, Fang Z (2018) Human action monitoring for healthcare based on deep learning. IEEE Access 6:52, 277–52, 285
Article Google Scholar
Garg R, BG VK, Carneiro G, Reid I (2016) Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: Proceedings of the european conference on computer vision (ECCV), pp 740–756
Chapter Google Scholar
Gunatilaka AH, Baertlein BA (2001) Feature-level and decision-level fusion of noncoincidently sampled sensors for land mine detection. IEEE Trans Pattern Anal Mach Intell 23(6):577–589
Article Google Scholar
Gupta K, Bhavsar A (2016) Scale invariant human action detection from depth cameras using class templates. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 38–45
Hall DL, Llinas J (1997) An introduction to multisensor data fusion. Proc IEEE 85(1):6–23
Article Google Scholar
Hartigan JA, Wong MA (1979) Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108
MATH Google Scholar
He C, Shao J, Sun J (2018) An anomaly-introduced learning method for abnormal event detection. Multimedia Tools and Applications 77(22):29, 573–29, 588
Article Google Scholar
He X, Cai D, Niyogi P (2006) Tensor subspace analysis. In: Advances in neural information processing systems, pp 499–506
Hu J, Zheng W, Lai J, Zhang J (2017) Jointly learning heterogeneous features for RGB-d activity recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2186–2200
Article Google Scholar
Ijjina EP, Chalavadi KM (2017) Human action recognition in RGB-d videos using motion sequence information and deep learning. Pattern Recogn 72:504–516
Article Google Scholar
Jia C, Fu Y (2016) Low-rank tensor subspace learning for RGB-d action recognition. IEEE Trans Image Process 25(10):4641–4652
Article MathSciNet Google Scholar
Jia C, Kong Y, Ding Z, Fu YR (2014a) Latent tensor transfer learning for RGB-D action recognition. In: Proceedings of the ACM international conference on multimedia (MM), pp 87–96
Jia C, Zhong G, Fu Y (2014b) Low-rank tensor learning with discriminant analysis for action classification and image recovery. In: Proceedings of the AAAI conference on artificial intelligence (CAI), pp 1228–1234
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 1725–1732
Khaire P, Kumar P, Imran J (2018) Combining cnn streams of RGB-D and skeletal data for human activity recognition. Pattern Recognition Letters pp P–P (in press)
Khaleghi B, Khamis A, Karray FO, Razavi SN (2013) Multisensor data fusion: a review of the state-of-the-art. Information Fusion 14(1):28–44
Article Google Scholar
Kim TY, Ko H (2005) Bayesian fusion of confidence measures for speech recognition. IEEE Signal Process Lett 12(12):871–874
Article Google Scholar
Klein LA (2004) Sensor and data fusion: a tool for information assessment and decision making. SPIE Press, Bellingham
Book Google Scholar
Koperski M, Bremond F (2016) Modeling spatial layout of features for real world scenario RGB-D action recognition. In: Proceedings of the IEEE international conference on advanced video and signal based surveillance (AVSS), pp 44–50
Koperski M, Bilinski P, Bremond F (2014) 3D trajectories for action recognition. In: Proceedings of the IEEE international conference on image processing (ICIP), pp 4176–4180
Koppula HS, Gupta R, Saxena A (2013) Learning human activities and object affordances from RGB-d videos. Int J Robot Res 32(8):951–970
Article Google Scholar
Kosmopoulos DI, Doliotis P, Athitsos V, Maglogiannis I (2013) Fusion of color and depth video for human behavior recognition in an assistive environment. In: Proceedings of the internation conference on distributed, ambient, and pervasive interactions (DAPI), pp 42–51
Chapter Google Scholar
Kumar P, Mittal A, Kumar P (2006) Fusion of thermal infrared and visible spectrum video for robust surveillance. In: Proceedings of the indian conference on computer vision, graphics and image processing (ICVGIP), pp 528–539
Chapter Google Scholar
Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley, New York
Book Google Scholar
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8
Liu AA, Nie WZ, Su YT, Ma L, Hao T, Yang ZX (2015) Coupled hidden conditional random fields for RGB-d human action recognition. Signal Process 112:74–82
Article Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Miranda L, Vieira T, Martínez D, Lewiner T, Vieira AW, Campos MFM (2014) Online gesture recognition from pose kernel learning and decision forests. Pattern Recogn Lett 39:65–73
Article Google Scholar
Ni B, Nguyen CD, Moulin P (2012) RGBD-camera based get-up event detection for hospital fall prevention. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1405–1408
Ni B, Pei Y, Moulin P, Yan S (2013) Multilevel depth and image fusion for human activity detection. IEEE Transactions on Cybernetics 43(5):1383–1394
Article Google Scholar
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with Fisher vectors on a compact feature set. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 1817–1824
Padhy RP, Chang X, Choudhury SK, Sa PK, Bakshi S (2018) Multi-stage cascaded deconvolution for depth map and surface normal prediction from single image. Pattern Recognition Letters pp P–P (in press)
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Article Google Scholar
Piyathilaka L, Kodagoda S (2013) Human activity recognition for domestic robots. In: Proceedings of the international conference on field and service robotics (FSR), pp 395–408
Google Scholar
Presti LL, Cascia ML (2016) 3D skeleton-based human action classification: a survey. Pattern Recogn 53:130–147
Article Google Scholar
Rahmani H, Mian A, Shah M (2018) Learning a deep model for human action recognition from novel viewpoints. IEEE Trans Pattern Anal Mach Intell 40(3):667–681
Article Google Scholar
Raman N, Maybank S (2015) Action classification using a discriminative multilevel HDP-HMM. Neurocomputing 154:149–161
Article Google Scholar
Ross AA, Govindarajan R (2005) Feature level fusion of hand and face biometrics. In: SPIE proceedings, pp 196–204
Sanchez-Riera J, Hua KL, Hsiao YS, Lim T, Hidayati SC, Cheng WH (2016) A comparative study of data fusion for RGB-d based visual recognition. Pattern Recogn Lett 73:1–6
Article Google Scholar
Scholkopf B, Sung KK, Burges CJC, Girosi F, Niyogi P, Poggio T, Vapnik V (1997) Comparing support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Trans Signal Process 45(11):2758–2765
Article Google Scholar
Shahroudy A, Wang G, Ng TT (2014) Multi-modal feature fusion for action recognition in RGB-D sequences. In: Proceedings of the international symposium on communications, control and signal processing (ISCCSP), pp 1–4
Sharma P, Kaur M (2013) Multimodal classification using feature level fusion and SVM. Int J Comput Appl 76(4):26–32
Google Scholar
Sung J, Ponce C, Selman B, Saxena A (2011) Human activity detection from RGBD images. In: Proceedings of the AAAI conference on plan, activity, and intent recognition (PAIR), pp 47–55
Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from RGBD images. In: Proceedings of the IEEE international conference on robotics and automation (ICRA), pp 842–849
Sykora P, Kamencay P, Hudec R (2014) Comparison of SIFT and SURF methods for use on hand gesture recognition based on depth map. AASRI Procedia 9:19–24
Article Google Scholar
Tripathi RK, Jalal AS, Agrawal SC (2018) Suspicious human activity recognition: a review. Artif Intell Rev 50(2):283–339
Article Google Scholar
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1290–1297
Wolf C, Mille J, Lombardi E, Celiktutan O, Jiu M, Dogan E, Eren G, Baccouche M, Dellandrea E, Bichot CE, Garcia C, Sankur B (2014) Evaluation of video activity localizations integrating quality and quantity measurements. Comput Vis Image Underst 127:14–30
Article Google Scholar
Xia L, Chen CC, Aggarwal JK (2012) View invariant human action recognition using histograms of 3D joints. In: Proceedings of the IEee conference on computer vision and pattern recognition workshops (CVPRW), pp 20–27
Xian Y, Rong X, Yang X, Tian Y (2017) Evaluation of low-level features for real-world surveillance event detection. IEEE Trans Circuits Syst Video Technol 27 (3):624–634
Article Google Scholar
Yan S, Xu D, Yang Q, Zhang L, Tang X, Zhang HJ (2005) Discriminant analysis with tensor representation. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), vol 1, pp 526–532
Yao T, Wang Z, Xie Z, Gao J, Feng DD (2017) Learning universal multiview dictionary for human action recognition. Pattern Recogn 64:236–244
Article Google Scholar
Zhong G, Cheriet M (2014) Large margin low rank tensor analysis. Neural Comput 26(4):761–780
Article MathSciNet Google Scholar
Zhou X, Zhuang X, Yan S, Chang SF, Hasegawa-Johnson M, Huang TS (2008) SIFT-bag kernel for video event analysis. In: Proceedings of the ACM international conference on multimedia (MM), pp 229–238
Zhu Y, Chen W, Guo G (2013) Fusing spatiotemporal features and joints for 3D action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 486–491
Zhu Y, Chen W, Guo G (2014) Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis Comput 32(8):453–464
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the MIUR under grant “Departments of Excellence 2018-2022” of the Department of Computer Science of Sapienza University.

Author information

Authors and Affiliations

Department of Mathematics, Computer Science and Physics, University of Udine, Udine, Italy
Danilo Avola & Gian Luca Foresti
Department of Computer, Sapienza University, Rome, Italy
Danilo Avola & Marco Bernardi

Authors

Danilo Avola
View author publications
You can also search for this author inPubMed Google Scholar
Marco Bernardi
View author publications
You can also search for this author inPubMed Google Scholar
Gian Luca Foresti
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Danilo Avola.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Avola, D., Bernardi, M. & Foresti, G.L. Fusing depth and colour information for human action recognition. Multimed Tools Appl 78, 5919–5939 (2019). https://doi.org/10.1007/s11042-018-6875-7

Download citation

Received: 04 October 2017
Revised: 08 November 2018
Accepted: 13 November 2018
Published: 27 November 2018
Issue Date: March 2019
DOI: https://doi.org/10.1007/s11042-018-6875-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fusing depth and colour information for human action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Better Performance in Human Action Recognition from Spatiotemporal Depth Information Features Classification

Hybrid Multi-modal Fusion for Human Action Recognition

Multimodal Body Sensor for Recognizing the Human Activity Using DMOA Based FS with DL

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now