Abstract
In activity recognition, usage of depth data is a rapidly growing research area. This paper presents a method for recognizing single-person activities and dyadic interactions by using deep features extracted from both 3D and 2D representations, which are constructed from depth sequences. First, a 3D volume representation is generated by considering spatiotemporal information in depth frames of an action sequence. Then, a 3D-CNN is trained to learn features from these 3D volume representations. In addition to this, a 2D representation is constructed from the weighted sum of the depth sequences. This 2D representation is used with a pre-trained CNN model. Features learned from this model and the 3D-CNN model are used in training of the final approach after a feature selection step. Among the various classifiers, an SVM-based model produced the best results. The proposed method was tested on the MSR-Action3D dataset for single-person activities, the SBU dataset for dyadic interactions, and the NTU RGB+D dataset for both types of actions. Experimental results show that proposed 3D and 2D representations and deep features extracted from them are robust and efficient. The proposed method achieves comparable results with the state of the art methods in the literature.



Similar content being viewed by others
References
Shechtman, E., Irani, M.: Space–time behavior based correlation. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), pp. 405–412. IEEE (2005)
Ke, Y., Sukthankar, R., Hebert, M.: Spatio-temporal shape and flow correlation for action recognition. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007)
Pei, L.S., Ye, M., Zhao, X.Z., Xiang, T., Li, T.: Learning spatio-temporal features for action recognition from the side of the video. Signal Image Video Process. 10, 199–206 (2016)
Ryoo, M., Chen, C.-C., Aggarwal, J., Roy-Chowdhury, A.: An overview of contest on semantic description of human activities (SDHA) 2010. In: Recognizing Patterns in Signals, Speech, Images and Videos, pp. 270–285. Springer (2010)
Zhang, C., Platt, J.C., Viola, P.A.: Multiple instance boosting for object detection. In: Advances in Neural Information Processing Systems, pp. 1417–1424 (2005)
Al Ghamdi, M., Zhang, L., Gotoh, Y.: Spatio-temporal SIFT and its application to human action classification. In: European Conference on Computer Vision, pp. 301–310. Springer (2012)
Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a hough-voting action recognition system. In: Recognizing Patterns in Signals, Speech, Images and Videos, pp. 306–312. Springer (2010)
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 9–14. IEEE (2010)
Iosifidis, A., Tefas, A., Pitas, I.: View-invariant action recognition based on artificial neural networks. IEEE Trans. Neural Netw. Learn. Syst. 23, 412–24 (2012)
Iosifidis, A., Tefas, A., Pitas, I.: Multi-view action recognition based on action volumes, fuzzy distances and cluster discriminant analysis. Signal Process. 93, 1445–57 (2013)
Tsai, D.M., Chiu, W.Y., Lee, M.H.: Optical flow-motion history image (OF-MHI) for action recognition. Signal Image Video Process. 9, 1897–906 (2015)
Mahbub, U., Imtiaz, H., Ahad, M.A.R.: Action recognition based on statistical analysis from clustered flow vectors. Signal Image Video Process. 8, 243–53 (2014)
Xia, L., Chen, C.-C., Aggarwal, J.: View invariant human action recognition using histograms of 3D joints. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–27. IEEE (2012)
Raptis, M., Kirovski, D., Hoppe, H.: Real-time classification of dance gestures from skeleton animation. In: Proceedings of the 2011 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 147–156. ACM (2011)
Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56, 116–24 (2013)
Sung, J., Ponce, C., Selman, B., Saxena, A.: Human activity detection from RGBD images. In: Plan, Activity, and Intent Recognition AAAI Workshop (2011)
Ryoo, M.S., Aggarwal, J.K.: Recognition of composite human activities through context-free grammar based representation. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pp. 1709–1718. IEEE (2006)
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1593-1600. IEEE (2009)
Park, S., Aggarwal, J.K.: A hierarchical Bayesian network for event recognition of human actions and interactions. Multimed. Syst. 10, 164–79 (2004)
Ji, Y., Ye, G., Cheng, H.: Interactive body part contrast mining for human interaction recognition. In: 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 1–6. IEEE (2014)
Ji, Y., Cheng, H., Zheng, Y., Li, H.: Learning contrastive feature distribution model for interaction recognition. J. Vis. Commun. Image Represent. 33, 340–9 (2015)
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35. IEEE (2012)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. 35, 221–31 (2013)
Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–731 (2014)
Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3361–3368. IEEE (2011)
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: International Workshop on Human Behavior Understanding, pp. 29–39. Springer (2011)
Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.O.: Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Hum. Mach. Syst. 46, 498–509 (2016)
Valle, E.A., Starostenko, O.: Recognition of human walking/running actions based on neural network. In: 2013 10th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), pp. 239–244. IEEE (2013)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE (2015)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB plus D: a large scale dataset for 3D human activity analysis. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (Cpvr), pp. 1010-1019 (2016)
Mishkin, D., Sergievskiy, N., Matas, J.: Systematic evaluation of CNN advances on the ImageNet. arXiv preprint arXiv:1606.02228 (2016)
Kamnitsas, K., Ledig, C., Newcombe, V.F.J., Sirnpson, J.P., Kane, A.D., Menon, D.K., et al.: Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017)
Shin, H.C., Roth, H.R., Gao, M.C., Lu, L., Xu, Z.Y., Nogues, I., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imag. 35, 1285–98 (2016)
Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)
Keceli, A.S., Can, A.B.: A multimodal approach for recognizing human actions using depth information. Int. Conf. Pattern Recognit. 22, 421–426 (2014)
Ahad, M.A.R.: Motion history images for action recognition and understanding. Springer Science & Business Media, Berlin (2012)
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E. et al.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: ICML, pp. 647–655 (2014)
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: “How transferable are features in deep neural networks?”, Advances in neural information processing systems, 3320-8 (2014)
Kononenko, I., Šimec, E., Robnik-Šikonja, M.: Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl Intell 7, 39–55 (1997)
Yang, X., Tian, Y.L.: “Eigenjoints-based action recognition using naive-bayes-nearest-neighbor”, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, pp. 14-9 (2012)
Oreifej, O., Liu, Z.: “Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716-23 (2013)
Yang, X., Zhang, C., Tian, Y.: “Recognizing actions using depth motion maps-based histograms of oriented gradients”, Proceedings of the 20th ACM international conference on Multimedia. ACM, pp. 1057-60 (2012)
Wang, J., Liu, Z.C., Wu, Y., Yuan, J.S.: “Mining Actionlet Ensemble for Action Recognition with Depth Cameras”. 2012 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), pp. 1290-7 (2012)
Ohn-Bar, E., Trivedi, M.: “Joint angles similarities and HOG2 for action recognition”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 465-70 (2013)
Liu, J., Shahroudy, A., Xu, D., Wang, G.: “Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition”. Computer Vision - Eccv 2016, Pt Iii, 9907, pp. 816-33 (2016)
Du, Y., Wang, W., Wang, H.: “Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition”. 2015 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), pp. 1110-8 (2015)
Liu, H., Tu, J., Liu, M.: “Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition”. arXiv preprint arXiv:1705.08106 (2017)
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: “An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data”, AAAI, pp. 4263-70 (2017)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Keçeli, A.S., Kaya, A. & Can, A.B. Combining 2D and 3D deep models for action recognition with depth information. SIViP 12, 1197–1205 (2018). https://doi.org/10.1007/s11760-018-1271-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-018-1271-3