skip to main content
research-article

Fusing Multiple Features for Depth-Based Action Recognition

Published: 31 March 2015 Publication History

Abstract

Human action recognition is a very active research topic in computer vision and pattern recognition. Recently, it has shown a great potential for human action recognition using the three-dimensional (3D) depth data captured by the emerging RGB-D sensors. Several features and/or algorithms have been proposed for depth-based action recognition. A question is raised: Can we find some complementary features and combine them to improve the accuracy significantly for depth-based action recognition? To address the question and have a better understanding of the problem, we study the fusion of different features for depth-based action recognition. Although data fusion has shown great success in other areas, it has not been well studied yet on 3D action recognition. Some issues need to be addressed, for example, whether the fusion is helpful or not for depth-based action recognition, and how to do the fusion properly. In this article, we study different fusion schemes comprehensively, using diverse features for action characterization in depth videos. Two different levels of fusion schemes are investigated, that is, feature level and decision level. Various methods are explored at each fusion level. Four different features are considered to characterize the depth action patterns from different aspects. The experiments are conducted on four challenging depth action databases, in order to evaluate and find the best fusion methods generally. Our experimental results show that the four different features investigated in the article can complement each other, and appropriate fusion methods can improve the recognition accuracies significantly over each individual feature. More importantly, our fusion-based action recognition outperforms the state-of-the-art approaches on these challenging databases.

References

[1]
J. K. Aggarwal and Michael S. Ryoo. 2011. Human activity analysis: A review. ACM Comput. Surv. (CSUR) 43, 3 (2011), 16.
[2]
F. M. Alkoot and J. Kittler. 1999. Experimental evaluation of expert fusion strategies. Pattern Recogn. Lett. 20, 11 (1999), 1361--1369.
[3]
P. K. Atrey, M. A. Hossain, A. El S., and M. S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Syst. 16, 6 (2010), 345--379.
[4]
H. Bay, T. Tuytelaars, and G. Luc Van. 2006. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision. Springer, 404--417.
[5]
S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz. 1999. Fusion of face and speech data for person identity verification. IEEE Trans. Neural Networks 10, 5 (1999), 1065--1074.
[6]
L. Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32.
[7]
G. Brown, A. Pocock, M. J. Zhao, and M. Luján. 2012. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13 (2012), 27--66.
[8]
K. Chang, K. Bowyer, and P. Flynn. 2003. Face recognition using 2D and 3D facial data. In Proceedings of the ACM Workshop on Multimodal User Authentication. 25--32.
[9]
L. L. Chen, H. Wei, and J. Ferryman. 2013. A survey of human motion analysis using depth imagery. Pattern Recogn. Lett. 34, 15 (2013), 1995--2006.
[10]
M. C. Da C. A. and M. Fairhurst. 2009. Analyzing the benefits of a novel multiagent approach in a multimodal biometrics identification task. IEEE Syst. J. 3, 4 (2009), 410--417.
[11]
D. L. Donoho and others. 2000. High-dimensional data analysis: The curses and blessings of dimensionality. In Proceedings of the AMS Math Challenges Lecture. 1--32.
[12]
F. Fleuret. 2004. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5 (2004), 1531--1555.
[13]
D. L. Hall and J. Llinas. 1997. An introduction to multisensor data fusion. Proc. IEEE 85, 1 (1997), 6--23.
[14]
J. Kittler. 1998. Combining classifiers: A theoretical framework. Pattern Anal. Appl. 1, 1 (1998), 18--27.
[15]
A. Klaser, M. Marszałek, C. Schmid, and others. 2008. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference.
[16]
T. Kobayashi and N. Otsu. 2012. Motion recognition using local auto-correlation of space--time gradients. Pattern Recog. Lett. 33, 9 (2012), 1188--1195.
[17]
H. S. Koppula, R. Gupta, and A. Saxena. 2013. Learning human activities and object affordances from RGB-D videos. Int. J. Rob. Res. 32, 8 (2013), 951--970.
[18]
L. I. Kuncheva. 2002. A theoretical study on six classifier fusion strategies. IEEE Trans. Pattern Anal. Mach. Intell. 24, 2 (2002), 281--286.
[19]
L. I. Kuncheva, J. C. Bezdek, and R. PW Duin. 2001. Decision templates for multiple classifier fusion: An experimental comparison. Pattern Recog. 34, 2 (2001), 299--314.
[20]
I. Laptev and T. Lindeberg. 2004. Velocity adaptation of space-time interest points. In Proceedings of the 17th International Conference on Pattern Recognition, Vol. 1. 52--56.
[21]
I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--8.
[22]
W. Q. Li, Z. Y. Zhang, and Z. C. Liu. 2010. Action recognition based on a bag of 3D points. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops. 9--14.
[23]
L. Liu and L. Shao. 2013. Learning discriminative representations from RGB-D video data. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI’13). AAAI Press, 1493--1500.
[24]
L. Miranda, T. Vieira, D. Martinez, T. Lewiner, A. W. Vieira, and M. F. M. Campos. 2012. Real-time gesture recognition from depth data through key poses learning and decision forests. In Proceedings of the 25th SIBGRAPI Conference on Graphics, Patterns and Images. 268--275.
[25]
B. B. Ni, G. Wang, and P. Moulin. 2011. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In Proceedings of the IEEE Computer Vision Workshops (ICCV’11). 1147--1153.
[26]
E. Ohn-Bar and M. M. Trivedi. 2013. Joint angles similarities and HOG2 for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 465--470.
[27]
O. Oreifej and Z. C. Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 716--723.
[28]
R. Poppe. 2010. A survey on vision-based human action recognition. Image Vision Comput. 28, 6 (2010), 976--990.
[29]
M. Reyes, G. Domínguez, and S. Escalera. 2011. Featureweighting in dynamic timewarping for gesture recognition in depth data. In Proceedings of the IEEE Computer Vision Workshops. 1182--1188.
[30]
A. Ross and A. K. Jain. 2003. Information fusion in biometrics. Pattern Recog. Lett. 24, 13 (2003), 2115--2125.
[31]
A. A Ross and R. Govindarajan. 2005. Feature level fusion of hand and face biometrics. In Defense and Security. International Society for Optics and Photonics, 196--204.
[32]
L. Seidenari, V. Varano, S. Berretti, P. Pala, and B. Alberto Del. 2013. Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In Proceedings of CVPR International Workshop on Human Activity Understanding from 3D Data (HAU3D’13). 479--485.
[33]
S. Sempena, N. U. Maulidevi, and P. R. Aryan. 2011. Human action recognition using dynamic time warping. In Proceedings of the International Conference on Electrical Engineering and Informatics (ICEEI). 1--5.
[34]
A. H. Shabani, D. A. Clausi, and J. S. Zelek. 2012. Evaluation of local spatio-temporal salient feature detectors for human action recognition. In Proceedings of the 9th IEEE Conference on Computer and Robot Vision. 468--475.
[35]
J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. 2011. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1297--1304.
[36]
J. Sung, C. Ponce, B. Selman, and A. Saxena. 2012. Unstructured human activity detection from RGBD images. In Proceedings of the IEEE International Conference on Robotics and Automation. 842--849.
[37]
I. Theodorakopoulos, D. Kastaniotis, G. Economou, and S. Fotopoulos. 2014. Pose-based human action recognition via sparse representation in dissimilarity space. J. Vis. Commun. Image Rep. 25, 1 (Jan. 2014), 12--23.
[38]
P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea. 2008. Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Technol. 18, 11 (2008), 1473--1488.
[39]
V. N. Vapnik. 1998. Statistical Learning Theory. Wiley, New York.
[40]
A. Vieira, E. Nascimento, G. Oliveira, Z. C. Liu, and M. Campos. 2012. Stop: Space-time occupancy patterns for 3D action recognition from depth map sequences. Prog. Pattern Recog., Image Anal., Comput. Vis., Appl. (2012), 252--259.
[41]
C. Y. Wang, Y. Z. Wang, and A. L. Yuille. 2013. An approach to pose-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 915--922.
[42]
J. Wang, Z. C. Liu, J. Chorowski, Z. Y. Chen, and Y. Wu. 2012a. Robust 3D action recognition with random occupancy patterns. In Proceedings of the European Conference on Computer Vision. Springer, 872--885.
[43]
J. Wang, Z. C. Liu, Y. Wu, and J. S. Yuan. 2012b. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1290--1297.
[44]
D. Weinland, R. Ronfard, and E. Boyer. 2011. A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vision Image Understanding 115, 2 (2011), 224--241.
[45]
G. Willems, T. Tuytelaars, and Luc Van G. 2008. An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of the European Conference on Computer Vision. 650--663.
[46]
L. Xia and J. K. Aggarwal. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2834--2841.
[47]
L. Xia, C. C. Chen, and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops. 20--27.
[48]
L. Xu, A. Krzyzak, and C. Y. Suen. 1992. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Sys., Man Cybern. 22, 3 (1992), 418--435.
[49]
H. Yang and J. Moody. 1999. Feature selection based on joint mutual information. In Proceedings of the International ICSC Symposium on Advances in Intelligent Data Analysis. 22--25.
[50]
X. D. Yang and Y. L. Tian. 2014. Effective 3D action recognition using eigenjoints. J. Visual Commun. Image Represent. 25, 1 (2014), 2--11.
[51]
X. D. Yang and Y. L. Tian. 2012. Eigenjoints-based action recognition using naive-Bayes-nearest-neighbor. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops (CVPRW’12). 14--19.
[52]
X. D. Yang, C. Y. Zhang, and Y. L. Tian. 2012. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM International Conference on Multimedia. 1057--1060.
[53]
Y. Zhao, Z. C. Liu, L. Yang, and H. Cheng. 2012. Combing RGB and depth map features for human activity recognition. In Proceedings of the 2012 Asia-Pacific Signal Information Processing Association Annual Summit and Conference (APSIPA ASC’12). 1--4.
[54]
Y. Zhu, W. B. Chen, and G. D. Guo. 2013. Fusing spatiotemporal features and joints for 3D action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’13). 486--491.

Cited By

View all
  • (2025)Multimodal Document Analytics for Banking Process AutomationInformation Fusion10.1016/j.inffus.2025.102973118(102973)Online publication date: Jun-2025
  • (2025)Multi-modal deep learning for credit rating prediction using text and numerical data streamsApplied Soft Computing10.1016/j.asoc.2025.112771171(112771)Online publication date: Mar-2025
  • (2024)Perceiving Actions via Temporal Video Frame PairsACM Transactions on Intelligent Systems and Technology10.1145/3652611Online publication date: 17-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 6, Issue 2
Special Section on Visual Understanding with RGB-D Sensors
May 2015
381 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/2753829
  • Editor:
  • Huan Liu
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2015
Accepted: 01 March 2014
Revised: 01 December 2013
Received: 01 July 2013
Published in TIST Volume 6, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 4D descriptor
  2. RGB-D sensor
  3. action recognition
  4. data fusion
  5. decision level
  6. depth maps
  7. feature level
  8. feature selection
  9. skeleton
  10. spatiotemporal features

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Multimodal Document Analytics for Banking Process AutomationInformation Fusion10.1016/j.inffus.2025.102973118(102973)Online publication date: Jun-2025
  • (2025)Multi-modal deep learning for credit rating prediction using text and numerical data streamsApplied Soft Computing10.1016/j.asoc.2025.112771171(112771)Online publication date: Mar-2025
  • (2024)Perceiving Actions via Temporal Video Frame PairsACM Transactions on Intelligent Systems and Technology10.1145/3652611Online publication date: 17-Mar-2024
  • (2024)A hybrid deep learning framework for daily living human activity recognition with cluster-based video summarizationMultimedia Tools and Applications10.1007/s11042-024-19022-0Online publication date: 18-Apr-2024
  • (2024)Decoupled spatio-temporal grouping transformer for skeleton-based action recognitionThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-023-03132-140:8(5733-5745)Online publication date: 1-Aug-2024
  • (2023)Daily Living Human Activity Recognition Using Deep Neural Networks2023 International Workshop on Intelligent Systems (IWIS)10.1109/IWIS58789.2023.10284678(1-6)Online publication date: 9-Aug-2023
  • (2023)Two-Stream Architecture Using RGB-based ConvNet and Pose-based LSTM for Video Action Recognition2023 15th International Conference on Innovations in Information Technology (IIT)10.1109/IIT59782.2023.10366415(127-131)Online publication date: 14-Nov-2023
  • (2023)DroneAttention: Sparse weighted temporal attention for drone-camera based activity recognitionNeural Networks10.1016/j.neunet.2022.12.005159(57-69)Online publication date: Feb-2023
  • (2022)A Holistic Approach for Role Inference and Action Anticipation in Human TeamsACM Transactions on Intelligent Systems and Technology10.1145/353123013:6(1-24)Online publication date: 22-Sep-2022
  • (2022)Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch FusionACM Transactions on Intelligent Systems and Technology10.1145/350671613:3(1-20)Online publication date: 3-Mar-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media