Skip to main content
Log in

Skeleton-based structured early activity prediction

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

To communicate with people, robots and vision-based interactive systems often need to understand human activities in advance before the activity is performed completely. This early prediction of the activities will help them take proper near future steps to fulfill a realistic interactive session with humans. However, predicting activities in advance is a very challenging task, because some activities are simple while others are complex and comprised of several smaller atomic sub-activities. In this paper, we propose a method capable of early prediction of simple and complex human activities by formulating it as a structured prediction task using probabilistic graphical models (PGM). We use skeletons captured from low-cost depth sensors as high-level descriptions of the human body. Using 3D skeletons, our method will be robust to the environmental factors. Our proposed model is a fully observed PGM coupled with a clustering scheme to remove the dependency of our model to the number-of-middle-states hyperparameter. We test our method on three popular datasets: CAD-60, UT-Kinect, and Florence 3D and obtain accuracies of 97.6% , 100% and 96.11%, respectively. These datasets cover both simple and complex activities. When only half of the clip is observed, we achieve 93.33% and 96.9% accuracy on CAD-60 and UT-Kinect datasets, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Anirudh R, Turaga P, Su J, Srivastava A (2017) Elastic functional coding of riemannian trajectories. IEEE Trans Pattern Anal Mach Intell 39(5):922–936

    Article  Google Scholar 

  2. Arzani MM, Fathy M, Aghajan H, Azirani AA, Raahemifar K, Adeli E (2017) Structured prediction with short/long-range dependencies for human activity recognition from depth skeleton data. In: IROS

  3. Arzani MM, Fathy M, Azirani AA, Adeli E (2019) Switching structured prediction for simple and complex human activity recognition. Submitted to IEEE Transactions on Cybernetics

  4. Bouchard G, Triggs B (2004) The tradeoff between generative and discriminative classifiers. In: 16th IASC international symposium on computational statistics (COMPSTAT’04), pp 721–728

  5. Chakraborty A, Roy-Chowdhury AK (2014) Context-aware activity forecasting. In: Asian conference on computer vision. Springer, Berlin, pp 21–36

  6. Chatfield C (2016) The analysis of time series: an introduction. CRC Press, Boca Raton

    MATH  Google Scholar 

  7. Chauvet M, Hamilton JD (2006) Dating business cycle turning points. Contributions to Economic Analysis 276:1–54

    Article  Google Scholar 

  8. Chen W, Guo G (2015) Triviews: a general framework to use 3d depth data effectively for action recognition. J Visual Commun Image Representation 26:182–191

    Article  Google Scholar 

  9. Chiu H-K, Adeli E, Wang B, Huang D-A, Niebles JC (2019) Action-agnostic human pose forecasting. In: Winter conference on applications of computer vision (WACV). IEEE, pp 1423–1432

  10. Cippitelli E, Gasparrini S, Gambi E, Spinsante S (2016) A human activity recognition system using skeleton data from rgbd sensors. Comput Intell Neurosci 2016:21

    Article  Google Scholar 

  11. Coppola C, Faria DR, Nunes U, Bellotto N (2016) Social activity recognition based on probabilistic merging of skeleton features with proximity priors from rgb-d data. In: 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 5055–5061

  12. Devanne M, Wannous H, Berretti S, Pala P, Daoudi M, Del Bimbo A (2015) 13-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans Cybern 45(7):1340–1352

    Article  Google Scholar 

  13. Ding W, Liu K, Cheng F, Zhang J (2016) Learning hierarchical spatio-temporal pattern for human activity prediction. Journal of Visual Communication and Image Representation 35:103–111

    Article  Google Scholar 

  14. Dutta V, Zielinska T (2018) Predicting human actions taking into account object affordances. J Intell Robotic Sys, pp 1–17

  15. Farha YA, Richard A, Gall J (2018) When will you do what?-anticipating temporal occurrences of activities. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5343–5352

  16. Faria DR, Premebida C, Nunes U (2014) A probabilistic approach for human everyday activities recognition using body motion from rgb-d images. In: The 23rd IEEE international symposium on robot and human interactive communication, 2014 RO-MAN. IEEE, pp 732–737

  17. Felsen P, Agrawal P, Malik J (2017) What will happen next? Forecasting player moves in sports videos. In: Proceedings of the IEEE international conference on computer vision, pp 3342–3351

  18. Gaglio S, Re GL, Morana M (2015) Human activity recognition process using 3-d posture data. IEEE Transactions on Human-Machine Systems 45(5):586–597

    Article  Google Scholar 

  19. Gupta R, Chia AY-S, Rajan D (2013) Human activities recognition using depth images. In: Proceedings of the 21st ACM international conference on multimedia. ACM, pp 283–292

  20. Hamilton JD (1989) A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica: Journal of the Econometric Society, pp 357–384

  21. Han F, Reily B, Hoff W, Zhang H (2017) Space-time representation of people based on 3d skeletal data: a review. Computer Vision and Image Understanding 158:85–105

    Article  Google Scholar 

  22. Han J, Shao L, Xu D, Shotton J (2013) Enhanced computer vision with microsoft kinect sensor: a review. IEEE Trans Cybern 43(5):1318–1334

    Article  Google Scholar 

  23. Hayes B, Shah JA (2017) Interpretable models for fast activity recognition and anomaly explanation during collaborative robotics tasks. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 6586–6593

  24. Hazan T, Urtasun R (2010) A primal-dual message-passing algorithm for approximated large scale structured prediction. In: Advances in neural information processing systems, pp 838–846

  25. Hu N, Englebienne G, Lou Z, Krose B (2016) Learning to recognize human activities using soft labels. IEEE Transactions on Pattern Analysis and Machine Intelligence

  26. Jain A, Zamir AR, Savarese S, Saxena A (2016) Structural-rnn: deep learning on spatio-temporal graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5308– 5317

  27. Jordan MI, Weiss Y (2002) Probabilistic inference in graphical models. Handbook of Neural Networks and Brain Theory

  28. Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, New York

    Google Scholar 

  29. Khodabandeh M, Vahdat A, Zhou G-T, Hajimirsadeghi H, Roshtkhari MJ, Mori G, Se S (2015) Discovering human interactions in videos with limited data labeling. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 9–18

  30. Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT Press, Cambridge

    MATH  Google Scholar 

  31. Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for rgb-d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1054– 1062

  32. Koniusz P, Cherian A, Porikli F (2016) Tensor representations via kernel linearization for action recognition from 3d skeletons. In: European conference on computer vision. Springer, pp 37–53

  33. Koppula HS, Saxena A (2016) Anticipating human activities using object affordances for reactive robotic response. IEEE Trans Pattern Anal Mach Intell 38 (1):14–29

    Article  Google Scholar 

  34. Koppula HS, Gupta R, Saxena A (2013) Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32(8):951–970

    Article  Google Scholar 

  35. Lafferty J, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data

  36. Li K, Fu Y (2014) Prediction of human activity by discovering temporal sequence patterns. IEEE Trans Pattern Anal Mach Intell 36(8):1644–1657

    Article  Google Scholar 

  37. Li M, Yan L, Wang Q (2018) Group sparse regression-based learning model for real-time depth-based human action prediction. Mathematical Problems in Engineering, 2018

  38. Liu A-A, Su Y-T, Jia P-P, Gao Z, Hao T, Yang Z-X (2015) Multiple/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybern 45(6):1194– 1208

    Article  Google Scholar 

  39. Liu J, Shahroudy A, Xu D, Chichung AK, Wang G (2017) Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence

  40. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision. Springer, pp 816–833

  41. Liu Y, Willsky A (2013) Learning gaussian graphical models with observed or latent fvss. In: Advances in neural information processing systems, pp 1833–1841

  42. Luo C, Ma C, Wang C-Y, Wang Y (2017) Learning discriminative activated simplices for action recognition. In: AAAI, pp 4211–4217

  43. Manzi A, Dario P, Cavallo F (2017) A human activity recognition system based on dynamic clustering of skeleton data. Sensors 17(5):1100

    Article  Google Scholar 

  44. Mici L, Parisi GI, Wermter S (2018) Recognition and prediction of human-object interactions with a self-organizing architecture

  45. Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Advances in neural information processing systems, pp 841–848

  46. Ni B, Pei Y, Moulin P, Yan S (2013) Multilevel depth and image fusion for human activity detection. IEEE Trans Cybern 43(5):1383–1394

    Article  Google Scholar 

  47. Nowozin S, Lampert CH, et al. (2011) Structured learning and prediction in computer vision. Foundations and Trends®;, in Computer Graphics and Vision 6 (3–4):185–365

    MATH  Google Scholar 

  48. Parisi GI, Weber C, Wermter S (2015) Self-organizing neural integration of pose-motion features for human action recognition. Frontiers in Neurorobotics, 9

  49. Piger J (2009) Econometrics: models of regime changes. In: Complex systems in finance and econometrics. Springer, pp 190–202

  50. Piyathilaka L, Kodagoda S (2013) Gaussian mixture based hmm for human daily activity recognition using 3d skeleton features. In: 2013 8th IEEE conference on industrial electronics and applications (ICIEA). IEEE, pp 567–572

  51. Qi S, Huang S, Wei P, Zhu S-C (2017) Predicting human activities using stochastic grammar. In: International conference on computer vision (ICCV). IEEE

  52. Quattoni A, Wang S, Morency L-P, Collins M, Darrell T (2007) Hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29(10)

  53. Rahmani H, Bennamoun M (2017) Learning action recognition model from depth and skeleton videos. In: The IEEE international conference on computer vision (ICCV)

  54. Raman N, Maybank SJ (2016) Non-parametric hidden conditional random fields for action classification. In: 2016 international joint conference on neural networks (IJCNN). IEEE, pp 3256–3263

  55. Reily B, Han F, Parker LE, Zhang H (2018) Skeleton-based bio-inspired human activity prediction for real-time human–robot interaction. Autonomous Robots 42(6):1281–1298

    Article  Google Scholar 

  56. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  57. Runsheng Y, Zhenyu S, Ma Q, Laiyun Q (2017) Predictive learning: using future representation learning variantial autoencoder for human action prediction. arXiv:1711.09265

  58. Schwing A, Hazan T, Pollefeys M, Urtasun R (2011) Distributed message passing for large scale graphical models. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1833–1840

  59. Schwing A, Hazan T, Pollefeys M, Urtasun R (2012) Efficient structured prediction with latent variables for general graphical models. In: Proceedings of the 29th international conference on machine learning ICML, pp 959–966

  60. Schwing AG, Hazan T, Pollefeys M, Urtasun R (2012) Distributed structured prediction for big data. In: NIPS workshop on big learning

  61. Schydlo P, Rakovic M, Jamone L, Santos-Victor J (2018) Anticipation in human-robot cooperation: a recurrent neural network approach for multiple action sequences prediction. In: 2018 IEEE international conference on robotics and automation (ICRA). IEEE, pp 1–6

  62. Seidenari L, Varano V, Berretti S, Bimbo A, Pala P (2013) Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 479–485

  63. Shan J, Akella S (2014) 3d human action segmentation and recognition using pose kinetic energy. In: 2014 IEEE workshop on advanced robotics and its social impacts (ARSO). IEEE, pp 69–75

  64. Shapovalova N, Vahdat A, Cannons K, Lan T, Mori G (2012) Similarity constrained latent support vector machine: an application to weakly supervised action classification. Computer Vision–ECCV 2012:55–68

    Google Scholar 

  65. Shi Z, Kim T-K (2017) Learning and refining of privileged information-based rnns for action recognition from depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3461–3470

  66. Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R (2013) Real-time human pose recognition in parts from single depth images. Communications of the ACM 56(1):116–124

    Article  Google Scholar 

  67. Slama R, Wannous H, Daoudi M (2014) Grassmannian representation of motion depth for 3d human gesture and action recognition. In: 2014 22nd international conference on pattern recognition (ICPR). IEEE, pp 3499–3504

  68. Sung J, Ponce C, Selman B, Saxena A (2011) Human activity detection from rgbd images. plan, activity, and intent recognition, 64

  69. Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from rgbd images. In: 2012 IEEE international conference on robotics and automation (ICRA). IEEE, pp 842–849

  70. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2):411–423

    Article  MathSciNet  Google Scholar 

  71. Tompson JJ, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems, pp 1799–1807

  72. Tong H (1990) Non-linear time series. A Dynamical System Approach

  73. Tong H (2012) Threshold models in non-linear time series analysis, vol 21. Springer, Berlin

    Google Scholar 

  74. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595

  75. Vemulapalli R, Arrate F, Chellappa R (2016) R3dg features: relative 3d geometry-based skeletal representations for human action recognition. Comput Vis Image Underst 152:155–166

    Article  Google Scholar 

  76. Wang C, Flynn J, Wang Y, Yuille AL (2016) Recognizing actions in 3d using action-snippets and activated simplices. In: AAAI, pp 3604–3610

  77. Wang C, Wang Y, Yuille AL (2016) Mining 3d key-pose-motifs for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2639–2647

  78. Wang H, Wang L (2018) Learning content and style: joint action recognition and person identification from human skeletons. Pattern Recogn 81:23–35

    Article  Google Scholar 

  79. Wang J, Liu Z, Wu Y (2014) Learning actionlet ensemble for 3d human action recognition. In: Human action recognition with depth camera. Springer, Berlin, pp 11–40

  80. Wang P, Yuan C, Hu W, Li B, Zhang Y (2016) Graph based skeleton motion representation and similarity measurement for action recognition. In: European conference on computer vision. Springer, pp 370–385

  81. Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Transactions on Human-Machine Systems 46(4):498–509

    Article  Google Scholar 

  82. Wu C, Zhang J, Savarese S, Saxena A (2015) Watch-n-patch: unsupervisedunderstanding of actions and relations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4362–4370

  83. Xia L, Chen C-C, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and patter recognition workshops (CVPRW). IEEE, pp 20–27

  84. Yang X, Tian YL (2014) Effective 3d action recognition using eigenjoints. Journal of Visual Communication and Image Representation 25(1):2–11

    Article  Google Scholar 

  85. Ye J, Li K, Qi G-J, Hua KA (2015) Temporal order-preserving dynamic quantization for human action recognition from multimodal sensor streams. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 99–106

  86. Yu C-NJ, Joachims T (2009) Learning structural svms with latent variables. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 1169–1176

  87. Zhang C, Tian Y (2012) Rgb-d camera-based daily living activity recognition. Journal of Computer Vision and Image Processing 2(4):12

    Google Scholar 

  88. Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) Rgb-d-based action recognition datasets: a survey. Pattern Recognition 60:86–105

    Article  Google Scholar 

  89. Zhang X, Wang Y, Gou M, Sznaier M, Camps O (2016) Efficient temporal sequence comparison and classification using gram matrix embeddings on a riemannian manifold. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4498–4507

  90. Zhu G, Zhang L, Shen P, Song J (2016) Human action recognition using multi-layer codebooks of key poses and atomic motions. Signal Process Image Commun 42:19–30

    Article  Google Scholar 

  91. Zhu Y, Chen W, Guo G (2014) Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis Comput 32(8):453–464

    Article  Google Scholar 

  92. Zhu Y, Chen W, Guo G (2015) Fusing multiple features for depth-based action recognition. ACM Transactions on Intelligent Systems and Technology (TIST) 6(2):18

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahmood Fathy.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arzani, M.M., Fathy, M., Azirani, A.A. et al. Skeleton-based structured early activity prediction. Multimed Tools Appl 80, 23023–23049 (2021). https://doi.org/10.1007/s11042-020-08875-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08875-w

Keywords

Navigation