Abstract
To communicate with people, robots and vision-based interactive systems often need to understand human activities in advance before the activity is performed completely. This early prediction of the activities will help them take proper near future steps to fulfill a realistic interactive session with humans. However, predicting activities in advance is a very challenging task, because some activities are simple while others are complex and comprised of several smaller atomic sub-activities. In this paper, we propose a method capable of early prediction of simple and complex human activities by formulating it as a structured prediction task using probabilistic graphical models (PGM). We use skeletons captured from low-cost depth sensors as high-level descriptions of the human body. Using 3D skeletons, our method will be robust to the environmental factors. Our proposed model is a fully observed PGM coupled with a clustering scheme to remove the dependency of our model to the number-of-middle-states hyperparameter. We test our method on three popular datasets: CAD-60, UT-Kinect, and Florence 3D and obtain accuracies of 97.6% , 100% and 96.11%, respectively. These datasets cover both simple and complex activities. When only half of the clip is observed, we achieve 93.33% and 96.9% accuracy on CAD-60 and UT-Kinect datasets, respectively.
Similar content being viewed by others
References
Anirudh R, Turaga P, Su J, Srivastava A (2017) Elastic functional coding of riemannian trajectories. IEEE Trans Pattern Anal Mach Intell 39(5):922–936
Arzani MM, Fathy M, Aghajan H, Azirani AA, Raahemifar K, Adeli E (2017) Structured prediction with short/long-range dependencies for human activity recognition from depth skeleton data. In: IROS
Arzani MM, Fathy M, Azirani AA, Adeli E (2019) Switching structured prediction for simple and complex human activity recognition. Submitted to IEEE Transactions on Cybernetics
Bouchard G, Triggs B (2004) The tradeoff between generative and discriminative classifiers. In: 16th IASC international symposium on computational statistics (COMPSTAT’04), pp 721–728
Chakraborty A, Roy-Chowdhury AK (2014) Context-aware activity forecasting. In: Asian conference on computer vision. Springer, Berlin, pp 21–36
Chatfield C (2016) The analysis of time series: an introduction. CRC Press, Boca Raton
Chauvet M, Hamilton JD (2006) Dating business cycle turning points. Contributions to Economic Analysis 276:1–54
Chen W, Guo G (2015) Triviews: a general framework to use 3d depth data effectively for action recognition. J Visual Commun Image Representation 26:182–191
Chiu H-K, Adeli E, Wang B, Huang D-A, Niebles JC (2019) Action-agnostic human pose forecasting. In: Winter conference on applications of computer vision (WACV). IEEE, pp 1423–1432
Cippitelli E, Gasparrini S, Gambi E, Spinsante S (2016) A human activity recognition system using skeleton data from rgbd sensors. Comput Intell Neurosci 2016:21
Coppola C, Faria DR, Nunes U, Bellotto N (2016) Social activity recognition based on probabilistic merging of skeleton features with proximity priors from rgb-d data. In: 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 5055–5061
Devanne M, Wannous H, Berretti S, Pala P, Daoudi M, Del Bimbo A (2015) 13-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans Cybern 45(7):1340–1352
Ding W, Liu K, Cheng F, Zhang J (2016) Learning hierarchical spatio-temporal pattern for human activity prediction. Journal of Visual Communication and Image Representation 35:103–111
Dutta V, Zielinska T (2018) Predicting human actions taking into account object affordances. J Intell Robotic Sys, pp 1–17
Farha YA, Richard A, Gall J (2018) When will you do what?-anticipating temporal occurrences of activities. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5343–5352
Faria DR, Premebida C, Nunes U (2014) A probabilistic approach for human everyday activities recognition using body motion from rgb-d images. In: The 23rd IEEE international symposium on robot and human interactive communication, 2014 RO-MAN. IEEE, pp 732–737
Felsen P, Agrawal P, Malik J (2017) What will happen next? Forecasting player moves in sports videos. In: Proceedings of the IEEE international conference on computer vision, pp 3342–3351
Gaglio S, Re GL, Morana M (2015) Human activity recognition process using 3-d posture data. IEEE Transactions on Human-Machine Systems 45(5):586–597
Gupta R, Chia AY-S, Rajan D (2013) Human activities recognition using depth images. In: Proceedings of the 21st ACM international conference on multimedia. ACM, pp 283–292
Hamilton JD (1989) A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica: Journal of the Econometric Society, pp 357–384
Han F, Reily B, Hoff W, Zhang H (2017) Space-time representation of people based on 3d skeletal data: a review. Computer Vision and Image Understanding 158:85–105
Han J, Shao L, Xu D, Shotton J (2013) Enhanced computer vision with microsoft kinect sensor: a review. IEEE Trans Cybern 43(5):1318–1334
Hayes B, Shah JA (2017) Interpretable models for fast activity recognition and anomaly explanation during collaborative robotics tasks. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 6586–6593
Hazan T, Urtasun R (2010) A primal-dual message-passing algorithm for approximated large scale structured prediction. In: Advances in neural information processing systems, pp 838–846
Hu N, Englebienne G, Lou Z, Krose B (2016) Learning to recognize human activities using soft labels. IEEE Transactions on Pattern Analysis and Machine Intelligence
Jain A, Zamir AR, Savarese S, Saxena A (2016) Structural-rnn: deep learning on spatio-temporal graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5308– 5317
Jordan MI, Weiss Y (2002) Probabilistic inference in graphical models. Handbook of Neural Networks and Brain Theory
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, New York
Khodabandeh M, Vahdat A, Zhou G-T, Hajimirsadeghi H, Roshtkhari MJ, Mori G, Se S (2015) Discovering human interactions in videos with limited data labeling. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 9–18
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT Press, Cambridge
Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for rgb-d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1054– 1062
Koniusz P, Cherian A, Porikli F (2016) Tensor representations via kernel linearization for action recognition from 3d skeletons. In: European conference on computer vision. Springer, pp 37–53
Koppula HS, Saxena A (2016) Anticipating human activities using object affordances for reactive robotic response. IEEE Trans Pattern Anal Mach Intell 38 (1):14–29
Koppula HS, Gupta R, Saxena A (2013) Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32(8):951–970
Lafferty J, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data
Li K, Fu Y (2014) Prediction of human activity by discovering temporal sequence patterns. IEEE Trans Pattern Anal Mach Intell 36(8):1644–1657
Li M, Yan L, Wang Q (2018) Group sparse regression-based learning model for real-time depth-based human action prediction. Mathematical Problems in Engineering, 2018
Liu A-A, Su Y-T, Jia P-P, Gao Z, Hao T, Yang Z-X (2015) Multiple/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybern 45(6):1194– 1208
Liu J, Shahroudy A, Xu D, Chichung AK, Wang G (2017) Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision. Springer, pp 816–833
Liu Y, Willsky A (2013) Learning gaussian graphical models with observed or latent fvss. In: Advances in neural information processing systems, pp 1833–1841
Luo C, Ma C, Wang C-Y, Wang Y (2017) Learning discriminative activated simplices for action recognition. In: AAAI, pp 4211–4217
Manzi A, Dario P, Cavallo F (2017) A human activity recognition system based on dynamic clustering of skeleton data. Sensors 17(5):1100
Mici L, Parisi GI, Wermter S (2018) Recognition and prediction of human-object interactions with a self-organizing architecture
Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Advances in neural information processing systems, pp 841–848
Ni B, Pei Y, Moulin P, Yan S (2013) Multilevel depth and image fusion for human activity detection. IEEE Trans Cybern 43(5):1383–1394
Nowozin S, Lampert CH, et al. (2011) Structured learning and prediction in computer vision. Foundations and Trends®;, in Computer Graphics and Vision 6 (3–4):185–365
Parisi GI, Weber C, Wermter S (2015) Self-organizing neural integration of pose-motion features for human action recognition. Frontiers in Neurorobotics, 9
Piger J (2009) Econometrics: models of regime changes. In: Complex systems in finance and econometrics. Springer, pp 190–202
Piyathilaka L, Kodagoda S (2013) Gaussian mixture based hmm for human daily activity recognition using 3d skeleton features. In: 2013 8th IEEE conference on industrial electronics and applications (ICIEA). IEEE, pp 567–572
Qi S, Huang S, Wei P, Zhu S-C (2017) Predicting human activities using stochastic grammar. In: International conference on computer vision (ICCV). IEEE
Quattoni A, Wang S, Morency L-P, Collins M, Darrell T (2007) Hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29(10)
Rahmani H, Bennamoun M (2017) Learning action recognition model from depth and skeleton videos. In: The IEEE international conference on computer vision (ICCV)
Raman N, Maybank SJ (2016) Non-parametric hidden conditional random fields for action classification. In: 2016 international joint conference on neural networks (IJCNN). IEEE, pp 3256–3263
Reily B, Han F, Parker LE, Zhang H (2018) Skeleton-based bio-inspired human activity prediction for real-time human–robot interaction. Autonomous Robots 42(6):1281–1298
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Runsheng Y, Zhenyu S, Ma Q, Laiyun Q (2017) Predictive learning: using future representation learning variantial autoencoder for human action prediction. arXiv:1711.09265
Schwing A, Hazan T, Pollefeys M, Urtasun R (2011) Distributed message passing for large scale graphical models. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1833–1840
Schwing A, Hazan T, Pollefeys M, Urtasun R (2012) Efficient structured prediction with latent variables for general graphical models. In: Proceedings of the 29th international conference on machine learning ICML, pp 959–966
Schwing AG, Hazan T, Pollefeys M, Urtasun R (2012) Distributed structured prediction for big data. In: NIPS workshop on big learning
Schydlo P, Rakovic M, Jamone L, Santos-Victor J (2018) Anticipation in human-robot cooperation: a recurrent neural network approach for multiple action sequences prediction. In: 2018 IEEE international conference on robotics and automation (ICRA). IEEE, pp 1–6
Seidenari L, Varano V, Berretti S, Bimbo A, Pala P (2013) Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 479–485
Shan J, Akella S (2014) 3d human action segmentation and recognition using pose kinetic energy. In: 2014 IEEE workshop on advanced robotics and its social impacts (ARSO). IEEE, pp 69–75
Shapovalova N, Vahdat A, Cannons K, Lan T, Mori G (2012) Similarity constrained latent support vector machine: an application to weakly supervised action classification. Computer Vision–ECCV 2012:55–68
Shi Z, Kim T-K (2017) Learning and refining of privileged information-based rnns for action recognition from depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3461–3470
Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R (2013) Real-time human pose recognition in parts from single depth images. Communications of the ACM 56(1):116–124
Slama R, Wannous H, Daoudi M (2014) Grassmannian representation of motion depth for 3d human gesture and action recognition. In: 2014 22nd international conference on pattern recognition (ICPR). IEEE, pp 3499–3504
Sung J, Ponce C, Selman B, Saxena A (2011) Human activity detection from rgbd images. plan, activity, and intent recognition, 64
Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from rgbd images. In: 2012 IEEE international conference on robotics and automation (ICRA). IEEE, pp 842–849
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2):411–423
Tompson JJ, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems, pp 1799–1807
Tong H (1990) Non-linear time series. A Dynamical System Approach
Tong H (2012) Threshold models in non-linear time series analysis, vol 21. Springer, Berlin
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595
Vemulapalli R, Arrate F, Chellappa R (2016) R3dg features: relative 3d geometry-based skeletal representations for human action recognition. Comput Vis Image Underst 152:155–166
Wang C, Flynn J, Wang Y, Yuille AL (2016) Recognizing actions in 3d using action-snippets and activated simplices. In: AAAI, pp 3604–3610
Wang C, Wang Y, Yuille AL (2016) Mining 3d key-pose-motifs for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2639–2647
Wang H, Wang L (2018) Learning content and style: joint action recognition and person identification from human skeletons. Pattern Recogn 81:23–35
Wang J, Liu Z, Wu Y (2014) Learning actionlet ensemble for 3d human action recognition. In: Human action recognition with depth camera. Springer, Berlin, pp 11–40
Wang P, Yuan C, Hu W, Li B, Zhang Y (2016) Graph based skeleton motion representation and similarity measurement for action recognition. In: European conference on computer vision. Springer, pp 370–385
Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Transactions on Human-Machine Systems 46(4):498–509
Wu C, Zhang J, Savarese S, Saxena A (2015) Watch-n-patch: unsupervisedunderstanding of actions and relations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4362–4370
Xia L, Chen C-C, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and patter recognition workshops (CVPRW). IEEE, pp 20–27
Yang X, Tian YL (2014) Effective 3d action recognition using eigenjoints. Journal of Visual Communication and Image Representation 25(1):2–11
Ye J, Li K, Qi G-J, Hua KA (2015) Temporal order-preserving dynamic quantization for human action recognition from multimodal sensor streams. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 99–106
Yu C-NJ, Joachims T (2009) Learning structural svms with latent variables. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 1169–1176
Zhang C, Tian Y (2012) Rgb-d camera-based daily living activity recognition. Journal of Computer Vision and Image Processing 2(4):12
Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) Rgb-d-based action recognition datasets: a survey. Pattern Recognition 60:86–105
Zhang X, Wang Y, Gou M, Sznaier M, Camps O (2016) Efficient temporal sequence comparison and classification using gram matrix embeddings on a riemannian manifold. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4498–4507
Zhu G, Zhang L, Shen P, Song J (2016) Human action recognition using multi-layer codebooks of key poses and atomic motions. Signal Process Image Commun 42:19–30
Zhu Y, Chen W, Guo G (2014) Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis Comput 32(8):453–464
Zhu Y, Chen W, Guo G (2015) Fusing multiple features for depth-based action recognition. ACM Transactions on Intelligent Systems and Technology (TIST) 6(2):18
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Arzani, M.M., Fathy, M., Azirani, A.A. et al. Skeleton-based structured early activity prediction. Multimed Tools Appl 80, 23023–23049 (2021). https://doi.org/10.1007/s11042-020-08875-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08875-w