research-article

Fusing Multiple Features for Depth-Based Action Recognition

Authors:

Guodong GuoAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 6, Issue 2

Article No.: 18, Pages 1 - 20

https://doi.org/10.1145/2629483

Published: 31 March 2015 Publication History

Abstract

Human action recognition is a very active research topic in computer vision and pattern recognition. Recently, it has shown a great potential for human action recognition using the three-dimensional (3D) depth data captured by the emerging RGB-D sensors. Several features and/or algorithms have been proposed for depth-based action recognition. A question is raised: Can we find some complementary features and combine them to improve the accuracy significantly for depth-based action recognition? To address the question and have a better understanding of the problem, we study the fusion of different features for depth-based action recognition. Although data fusion has shown great success in other areas, it has not been well studied yet on 3D action recognition. Some issues need to be addressed, for example, whether the fusion is helpful or not for depth-based action recognition, and how to do the fusion properly. In this article, we study different fusion schemes comprehensively, using diverse features for action characterization in depth videos. Two different levels of fusion schemes are investigated, that is, feature level and decision level. Various methods are explored at each fusion level. Four different features are considered to characterize the depth action patterns from different aspects. The experiments are conducted on four challenging depth action databases, in order to evaluate and find the best fusion methods generally. Our experimental results show that the four different features investigated in the article can complement each other, and appropriate fusion methods can improve the recognition accuracies significantly over each individual feature. More importantly, our fusion-based action recognition outperforms the state-of-the-art approaches on these challenging databases.

References

[1]

J. K. Aggarwal and Michael S. Ryoo. 2011. Human activity analysis: A review. ACM Comput. Surv. (CSUR) 43, 3 (2011), 16.

Digital Library

[2]

F. M. Alkoot and J. Kittler. 1999. Experimental evaluation of expert fusion strategies. Pattern Recogn. Lett. 20, 11 (1999), 1361--1369.

Digital Library

[3]

P. K. Atrey, M. A. Hossain, A. El S., and M. S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Syst. 16, 6 (2010), 345--379.

Digital Library

[4]

H. Bay, T. Tuytelaars, and G. Luc Van. 2006. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision. Springer, 404--417.

Digital Library

[5]

S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz. 1999. Fusion of face and speech data for person identity verification. IEEE Trans. Neural Networks 10, 5 (1999), 1065--1074.

Digital Library

[6]

L. Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32.

Digital Library

[7]

G. Brown, A. Pocock, M. J. Zhao, and M. Luján. 2012. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13 (2012), 27--66.

Digital Library

[8]

K. Chang, K. Bowyer, and P. Flynn. 2003. Face recognition using 2D and 3D facial data. In Proceedings of the ACM Workshop on Multimodal User Authentication. 25--32.

[9]

L. L. Chen, H. Wei, and J. Ferryman. 2013. A survey of human motion analysis using depth imagery. Pattern Recogn. Lett. 34, 15 (2013), 1995--2006.

Digital Library

[10]

M. C. Da C. A. and M. Fairhurst. 2009. Analyzing the benefits of a novel multiagent approach in a multimodal biometrics identification task. IEEE Syst. J. 3, 4 (2009), 410--417.

[11]

D. L. Donoho and others. 2000. High-dimensional data analysis: The curses and blessings of dimensionality. In Proceedings of the AMS Math Challenges Lecture. 1--32.

[12]

F. Fleuret. 2004. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5 (2004), 1531--1555.

Digital Library

[13]

D. L. Hall and J. Llinas. 1997. An introduction to multisensor data fusion. Proc. IEEE 85, 1 (1997), 6--23.

[14]

J. Kittler. 1998. Combining classifiers: A theoretical framework. Pattern Anal. Appl. 1, 1 (1998), 18--27.

Digital Library

[15]

A. Klaser, M. Marszałek, C. Schmid, and others. 2008. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference.

[16]

T. Kobayashi and N. Otsu. 2012. Motion recognition using local auto-correlation of space--time gradients. Pattern Recog. Lett. 33, 9 (2012), 1188--1195.

Digital Library

[17]

H. S. Koppula, R. Gupta, and A. Saxena. 2013. Learning human activities and object affordances from RGB-D videos. Int. J. Rob. Res. 32, 8 (2013), 951--970.

Digital Library

[18]

L. I. Kuncheva. 2002. A theoretical study on six classifier fusion strategies. IEEE Trans. Pattern Anal. Mach. Intell. 24, 2 (2002), 281--286.

Digital Library

[19]

L. I. Kuncheva, J. C. Bezdek, and R. PW Duin. 2001. Decision templates for multiple classifier fusion: An experimental comparison. Pattern Recog. 34, 2 (2001), 299--314.

[20]

I. Laptev and T. Lindeberg. 2004. Velocity adaptation of space-time interest points. In Proceedings of the 17th International Conference on Pattern Recognition, Vol. 1. 52--56.

Digital Library

[21]

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--8.

[22]

W. Q. Li, Z. Y. Zhang, and Z. C. Liu. 2010. Action recognition based on a bag of 3D points. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops. 9--14.

[23]

L. Liu and L. Shao. 2013. Learning discriminative representations from RGB-D video data. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI’13). AAAI Press, 1493--1500.

Digital Library

[24]

L. Miranda, T. Vieira, D. Martinez, T. Lewiner, A. W. Vieira, and M. F. M. Campos. 2012. Real-time gesture recognition from depth data through key poses learning and decision forests. In Proceedings of the 25th SIBGRAPI Conference on Graphics, Patterns and Images. 268--275.

Digital Library

[25]

B. B. Ni, G. Wang, and P. Moulin. 2011. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In Proceedings of the IEEE Computer Vision Workshops (ICCV’11). 1147--1153.

[26]

E. Ohn-Bar and M. M. Trivedi. 2013. Joint angles similarities and HOG2 for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 465--470.

Digital Library

[27]

O. Oreifej and Z. C. Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 716--723.

Digital Library

[28]

R. Poppe. 2010. A survey on vision-based human action recognition. Image Vision Comput. 28, 6 (2010), 976--990.

Digital Library

[29]

M. Reyes, G. Domínguez, and S. Escalera. 2011. Featureweighting in dynamic timewarping for gesture recognition in depth data. In Proceedings of the IEEE Computer Vision Workshops. 1182--1188.

[30]

A. Ross and A. K. Jain. 2003. Information fusion in biometrics. Pattern Recog. Lett. 24, 13 (2003), 2115--2125.

Digital Library

[31]

A. A Ross and R. Govindarajan. 2005. Feature level fusion of hand and face biometrics. In Defense and Security. International Society for Optics and Photonics, 196--204.

[32]

L. Seidenari, V. Varano, S. Berretti, P. Pala, and B. Alberto Del. 2013. Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In Proceedings of CVPR International Workshop on Human Activity Understanding from 3D Data (HAU3D’13). 479--485.

Digital Library

[33]

S. Sempena, N. U. Maulidevi, and P. R. Aryan. 2011. Human action recognition using dynamic time warping. In Proceedings of the International Conference on Electrical Engineering and Informatics (ICEEI). 1--5.

[34]

A. H. Shabani, D. A. Clausi, and J. S. Zelek. 2012. Evaluation of local spatio-temporal salient feature detectors for human action recognition. In Proceedings of the 9th IEEE Conference on Computer and Robot Vision. 468--475.

Digital Library

[35]

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. 2011. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1297--1304.

Digital Library

[36]

J. Sung, C. Ponce, B. Selman, and A. Saxena. 2012. Unstructured human activity detection from RGBD images. In Proceedings of the IEEE International Conference on Robotics and Automation. 842--849.

[37]

I. Theodorakopoulos, D. Kastaniotis, G. Economou, and S. Fotopoulos. 2014. Pose-based human action recognition via sparse representation in dissimilarity space. J. Vis. Commun. Image Rep. 25, 1 (Jan. 2014), 12--23.

Digital Library

[38]

P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea. 2008. Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Technol. 18, 11 (2008), 1473--1488.

Digital Library

[39]

V. N. Vapnik. 1998. Statistical Learning Theory. Wiley, New York.

[40]

A. Vieira, E. Nascimento, G. Oliveira, Z. C. Liu, and M. Campos. 2012. Stop: Space-time occupancy patterns for 3D action recognition from depth map sequences. Prog. Pattern Recog., Image Anal., Comput. Vis., Appl. (2012), 252--259.

[41]

C. Y. Wang, Y. Z. Wang, and A. L. Yuille. 2013. An approach to pose-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 915--922.

Digital Library

[42]

J. Wang, Z. C. Liu, J. Chorowski, Z. Y. Chen, and Y. Wu. 2012a. Robust 3D action recognition with random occupancy patterns. In Proceedings of the European Conference on Computer Vision. Springer, 872--885.

Digital Library

[43]

J. Wang, Z. C. Liu, Y. Wu, and J. S. Yuan. 2012b. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1290--1297.

Digital Library

[44]

D. Weinland, R. Ronfard, and E. Boyer. 2011. A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vision Image Understanding 115, 2 (2011), 224--241.

Digital Library

[45]

G. Willems, T. Tuytelaars, and Luc Van G. 2008. An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of the European Conference on Computer Vision. 650--663.

Digital Library

[46]

L. Xia and J. K. Aggarwal. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2834--2841.

Digital Library

[47]

L. Xia, C. C. Chen, and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops. 20--27.

[48]

L. Xu, A. Krzyzak, and C. Y. Suen. 1992. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Sys., Man Cybern. 22, 3 (1992), 418--435.

[49]

H. Yang and J. Moody. 1999. Feature selection based on joint mutual information. In Proceedings of the International ICSC Symposium on Advances in Intelligent Data Analysis. 22--25.

[50]

X. D. Yang and Y. L. Tian. 2014. Effective 3D action recognition using eigenjoints. J. Visual Commun. Image Represent. 25, 1 (2014), 2--11.

Digital Library

[51]

X. D. Yang and Y. L. Tian. 2012. Eigenjoints-based action recognition using naive-Bayes-nearest-neighbor. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops (CVPRW’12). 14--19.

[52]

X. D. Yang, C. Y. Zhang, and Y. L. Tian. 2012. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM International Conference on Multimedia. 1057--1060.

Digital Library

[53]

Y. Zhao, Z. C. Liu, L. Yang, and H. Cheng. 2012. Combing RGB and depth map features for human activity recognition. In Proceedings of the 2012 Asia-Pacific Signal Information Processing Association Annual Summit and Conference (APSIPA ASC’12). 1--4.

[54]

Y. Zhu, W. B. Chen, and G. D. Guo. 2013. Fusing spatiotemporal features and joints for 3D action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’13). 486--491.

Digital Library

Cited By

Gerling CLessmann S(2025)Multimodal Document Analytics for Banking Process AutomationInformation Fusion10.1016/j.inffus.2025.102973118(102973)Online publication date: Jun-2025
https://doi.org/10.1016/j.inffus.2025.102973
Tavakoli MChandra RTian FBravo C(2025)Multi-modal deep learning for credit rating prediction using text and numerical data streamsApplied Soft Computing10.1016/j.asoc.2025.112771171(112771)Online publication date: Mar-2025
https://doi.org/10.1016/j.asoc.2025.112771
Li RXu TWu XShen ZKittler J(2024)Perceiving Actions via Temporal Video Frame PairsACM Transactions on Intelligent Systems and Technology10.1145/3652611Online publication date: 17-Mar-2024
https://doi.org/10.1145/3652611
Show More Cited By

Recommendations

Multifeature Selection for 3D Human Action Recognition

In mainstream approaches for 3D human action recognition, depth and skeleton features are combined to improve recognition accuracy. However, this strategy results in high feature dimensions and low discrimination due to redundant feature vectors. To ...
Evaluating spatiotemporal interest point features for depth-based action recognition

Human action recognition has lots of real-world applications, such as natural user interface, virtual reality, intelligent surveillance, and gaming. However, it is still a very challenging problem. In action recognition using the visible light videos, ...
Automatic 3D face recognition from depth and intensity Gabor features

As is well known, traditional 2D face recognition based on optical (intensity or color) images faces many challenges, such as illumination, expression, and pose variation. In fact, the human face generates not only 2D texture information but also 3D ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 6, Issue 2

Special Section on Visual Understanding with RGB-D Sensors

May 2015

381 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/2753829

Editor:
Huan Liu
Arizona State University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2015

Accepted: 01 March 2014

Revised: 01 December 2013

Received: 01 July 2013

Published in TIST Volume 6, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
524
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gerling CLessmann S(2025)Multimodal Document Analytics for Banking Process AutomationInformation Fusion10.1016/j.inffus.2025.102973118(102973)Online publication date: Jun-2025
https://doi.org/10.1016/j.inffus.2025.102973
Tavakoli MChandra RTian FBravo C(2025)Multi-modal deep learning for credit rating prediction using text and numerical data streamsApplied Soft Computing10.1016/j.asoc.2025.112771171(112771)Online publication date: Mar-2025
https://doi.org/10.1016/j.asoc.2025.112771
Li RXu TWu XShen ZKittler J(2024)Perceiving Actions via Temporal Video Frame PairsACM Transactions on Intelligent Systems and Technology10.1145/3652611Online publication date: 17-Mar-2024
https://doi.org/10.1145/3652611
Hossain SDeb KSakib SSarker I(2024)A hybrid deep learning framework for daily living human activity recognition with cluster-based video summarizationMultimedia Tools and Applications10.1007/s11042-024-19022-0Online publication date: 18-Apr-2024
https://doi.org/10.1007/s11042-024-19022-0
Sun SJia ZZhu YLiu GYu Z(2024)Decoupled spatio-temporal grouping transformer for skeleton-based action recognitionThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-023-03132-140:8(5733-5745)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s00371-023-03132-1
Hossain MDeb KMinhaz Hossain SJo K(2023)Daily Living Human Activity Recognition Using Deep Neural Networks2023 International Workshop on Intelligent Systems (IWIS)10.1109/IWIS58789.2023.10284678(1-6)Online publication date: 9-Aug-2023
https://doi.org/10.1109/IWIS58789.2023.10284678
Huang CGochoo MTan T(2023)Two-Stream Architecture Using RGB-based ConvNet and Pose-based LSTM for Video Action Recognition2023 15th International Conference on Innovations in Information Technology (IIT)10.1109/IIT59782.2023.10366415(127-131)Online publication date: 14-Nov-2023
https://doi.org/10.1109/IIT59782.2023.10366415
Yadav SLuthra APahwa ETiwari KRathore HPandey HCorcoran P(2023)DroneAttention: Sparse weighted temporal attention for drone-camera based activity recognitionNeural Networks10.1016/j.neunet.2022.12.005159(57-69)Online publication date: Feb-2023
https://doi.org/10.1016/j.neunet.2022.12.005
Dong JHuo QFerrari S(2022)A Holistic Approach for Role Inference and Action Anticipation in Human TeamsACM Transactions on Intelligent Systems and Technology10.1145/353123013:6(1-24)Online publication date: 22-Sep-2022
https://dl.acm.org/doi/10.1145/3531230
Wei LLang CLiang LFeng SWang TChen S(2022)Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch FusionACM Transactions on Intelligent Systems and Technology10.1145/350671613:3(1-20)Online publication date: 3-Mar-2022
https://dl.acm.org/doi/10.1145/3506716
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents