Skip to main content
Log in

Deep learning-based for human segmentation and tracking, 3D human pose estimation and action recognition on monocular video of MADS dataset

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Human segmentation and tracking (HS-T) in the video often utilize person detection results. In addition, 3D human pose estimation (3D-HPE) and human activity recognition (HAR) often use human segmentation results to reduce data storage and computational time. With recent advantages of deep learning, especially using Convolutional Neural Networks (CNNs), there are excellent results in these relevant tasks. Consequently, they can be applied to building many practical applications such as sports analysis, sports scoring, health protection, teaching, and preserving traditional martial arts. In this paper, we performed a survey of relevant studies, methods, datasets, and results for HS-T, 3D-HPE, and HAR. We also deeply analyze the results of detecting persons as it affects the results of human segmentation and human tracking. The survey is performed in great detail up to source code paths. The MADS (Martial Arts, Dancing, and Sports) dataset comprises fast and complex activities. It has been published for the task of estimating human pose. However, before determining the human pose, the person needs to be detected as a segment in the video, especially the 3D human pose annotation data is different from the point cloud data generated from RGB-D images. Therefore, we have also prepared 2D human pose annotation data on the 28k images for creating 3D human pose annotation and action labeling data. Moreover, we also evaluated the MADS dataset with many recently published deep learning methods for human segmentation (Mask R-CNN, PointRend, TridentNet, TensorMask, and CenterMask) and tracking, 3D-HPE (RepNet, MediaPipe Pose, and Lifting from the Deep, V2V-PoseNet), and HAR (ST-GCN, DD-net, and PA-GesGCN) in the video. All data and published results are available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. http://visal.cs.cityu.edu.hk/research/mads/

  2. http://web.archive.org/web/20110827170646/http://kspace.cdvp.dcu.ie/public/interactive-segmentation/index.htmlhttp://kspace.cdvp.dcu.ie/public/interactive-segmentation/index.html, [accessed on, 18 April, 2021]

  3. https://www.robots.ox.ac.uk/~vgg/research/very_deep/

  4. https://neurohive.io/en/popular-networks/vgg16/, [accessed on20 May 2021]

  5. https://github.com/rbgirshick/fast-rcnn,[accessed on 25 May2021]

  6. https://towardsdatascience.com/faster-r-cnn-object-detection-implemented-by-keras-for-custom-data-from-googles-open-images-125f62b9141a, , [accessed on, 10 July, 2021]

  7. https://pjreddie.com/darknet/yolov1/

  8. https://pjreddie.com/darknet/yolov2/

  9. https://pjreddie.com/darknet/yolo/

  10. https://github.com/AlexeyAB/darknet, [accessed on, June, 2021]

  11. https://github.com/weiliu89/caffe/tree/ssd, [accessed on12 June 2021]

  12. https://github.com/matterport/Mask_RCNN, [accessed on, 14 June, 2021]

  13. https://github.com/facebookresearch/detectron2, [accessed on, 14 June, 2021]

  14. https://github.com/facebookresearch/detectron2/tree/master/projects/DeepLab, [accessed on, 12 June, 2021]

  15. https://github.com/facebookresearch/detectron2/tree/master/projects/DensePose, [accessed on, 12 June, 2021]

  16. https://github.com/facebookresearch/detectron2/tree/master/projects/Panoptic-DeepLab, [accessed on, 14 June, 2021]

  17. https://github.com/facebookresearch/detectron2/tree/master/projects/PointRend, [accessed on, 14 June, 2021]

  18. https://github.com/facebookresearch/detectron2/tree/master/projects/TensorMask, [accessed on, 20 June, 2021]

  19. https://github.com/facebookresearch/detectron2/tree/master/projects/TridentNet, [accessed on, 15 June, 2021]

  20. https://github.com/youngwanLEE/CenterMask, [accessed on, 16 June, 2021]

  21. https://github.com/scnuhealthy/Tensorflow_PersonLab, [accessed on, 16 June, 2021]

  22. http://host.robots.ox.ac.uk/pascal/VOC/voc2007/, [accessed on, 19 June, 2021]

  23. http://host.robots.ox.ac.uk/pascal/VOC/voc2012/, [accessed on, 18 June, 2021]

  24. https://github.com/JaviLaplaza/Pytorch-Siamese, [accessed on, 20 June, 2021]

  25. https://cocodataset.org/#home

  26. https://paperswithcode.com/sota/keypoint-detection-on-coco-test-challenge

  27. http://human-pose.mpi-inf.mpg.de/#overview

  28. http://human-pose.mpi-inf.mpg.de/#results

  29. http://vision.imar.ro/human3.6m/

  30. http://gvv.mpi-inf.mpg.de/3dhp-dataset/

  31. https://paperswithcode.com/sota/3d-human-pose-estimation-on-mpi-inf-3dhp

  32. http://web.archive.org/web/20110827170646/http://kspace.cdvp.dcu.ie/public/interactive-segmentation/index.html, [accessed~on,18April,2021]

  33. https://drive.google.com/file/d/1Ssob496MJMUy3vAiXkC_ChKbp4gx7OGL/view?usp=sharing, [accessed on, 18 July, 2021]

  34. https://drive.google.com/drive/folders/1qKxYRZIF3RI0LaA8K9wM3M684pEetHkx?usp=sharing

  35. https://github.com/duonglong289/detectron2, [accessed on, 10 June, 2021]

  36. https://github.com/duonglong289/detectron2/tree/master/projects/PointRend, [accessed on, 15 June, 2021]

  37. https://github.com/duonglong289/detectron2/tree/master/projects/TridentNet, [accessed on, 16 June, 2021]

  38. https://github.com/duonglong289/detectron2/tree/master/projects/TensorMask, [accessed on, 16 June, 2021]

  39. https://github.com/duonglong289/centermask2, [accessed on, 16 June, 2021]

  40. https://github.com/bastianwandt/RepNet

  41. https://github.com/TemugeB/bodypose3d

  42. https://google.github.io/mediapipe/solutions/pose.html

  43. https://github.com/DenisTome/Lifting-from-the-Deep-release

  44. https://drive.google.com/drive/folders/1qKxYRZIF3RI0LaA8K9wM3M684pEetHkx?usp=sharing

  45. https://drive.google.com/drive/folders/1qKxYRZIF3RI0LaA8K9wM3M684pEetHkx?usp=sharing

  46. https://github.com/duonglong289/detectron2.git

  47. https://github.com/duonglong289/centermask2.git

  48. https://drive.google.com/drive/folders/16YHR8MxOn4l8fMdNCJZv56AcLKfP_K4-?usp=sharing

  49. https://drive.google.com/drive/folders/1qKxYRZIF3RI0LaA8K9wM3M684pEetHkx?usp=sharing

  50. https://drive.google.com/drive/folders/1qKxYRZIF3RI0LaA8K9wM3M684pEetHkx?usp=sharing

References

  1. Allaya N, Khabir A, Sallemi-Boudawara T, Sellami N, Daoud J, Ghorbel A, Frikha M, Gargouri A, Mokdad-Gargouri R, Ayadi W (2010) Action recognition based on a bag of 3D point. In: 2010 IEEE computer society conference on computer vision and pattern recognition - workshops, vol 36, pp 3807–3814. https://doi.org/10.1007/s13277-014-3022-6

  2. Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2d human pose estimation new benchmark and state-of-the-art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  3. Bazarevsky V, Zhang F (2020) BlazePose : on-device real-time body pose tracking. arXiv:2006.10204

  4. Bewley A, Ge Z, Ott L, Ramos F, Upcroft B (2016) Simple online and realtime tracking. In: 2016 IEEE international conference on image processing (ICIP), pp 3464–3468. https://doi.org/10.1109/ICIP.2016.7533003https://doi.org/10.1109/ICIP.2016.7533003

  5. Bochkovskiy A, Wang CY, Liao HYM (2020) YOLOv4: optimal speed and accuracy of object detection

  6. Burrus N (2011) Kinect calibration. http://nicolas.burrus.name/index.php/Research/KinectCalibration. Accessed 05 April 2021

  7. Chahyati D, Fanany MI, Arymurthy AM (2017) Tracking people by detection using cnn features. In: Procedia computer science, vol 124, pp 167–172. Elsevier BV, https://doi.org/10.1016/j.procs.2017.12.143https://doi.org/10.1016/j.procs.2017.12.143

  8. Chen X, Girshick R, He K, Dollár P (2019) Tensormask: a foundation for dense object segmentation

  9. Chen W, Jiang Z, Ni HG, Fall X (2020) Detection based on key points of of human-skeleton using openpose. Symmetry

  10. Chen X, Lin KY, Liu W, Qian C, Lin L (2019) Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June, pp 10,887–10,896. https://doi.org/10.1109/CVPR.2019.01115

  11. Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587

  12. Chen CH, Ramanan D (2017) 3D human pose estimation = 2D pose estimation + matching. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January, pp 5759–5767. https://doi.org/10.1109/CVPR.2017.610

  13. Chen Y, Zhang Z, Yuan C, Li B, Deng Y, Hu W (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 13,339–13,348. https://doi.org/10.1109/ICCV48922.2021.01311

  14. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV

  15. Cheng B, Collins MD, Zhu Y, Liu T, Huang TS, Adam H, Chen LC (2019) Panoptic-deeplab. In: ICCV COCO + Mapillary joint recognition challenge workshop

  16. Cheng B, Collins MD, Zhu Y, Liu T, Huang TS, Adam H, Chen LC (2020) Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: CVPR

  17. Ciaparrone G, Luque sánchez F, Tabik S, Troiano L, Tagliaferri R, Herrera F (2020) Deep learning in video multi-object tracking: a survey. Neurocomputing 381:61–88. https://doi.org/10.1016/j.neucom.2019.11.023https://doi.org/10.1016/j.neucom.2019.11.023

    Article  Google Scholar 

  18. Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. Adv Neural Inf Process Syst:379–387

  19. Dang Q, Yin J, Wang B, Zheng W (2021) Deep learning based 2D human pose estimation: a survey. IEEE Trans Pattern Anal Mach Intell 24(6):663–676. https://doi.org/10.26599/TST.2018.9010100

    Google Scholar 

  20. Das S, Sharma S, Dai R, Brémond F, Thonnat M (2020) VPN: learning video-pose embedding for activities of daily living. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 12354 LNCS, pp 72–90. https://doi.org/10.1007/978-3-030-58545-7_5

  21. Ding Z, Wang P, Ogunbona PO, Li W (2017) Investigation of different skeleton features for CNN-based 3D action recognition. In: 2017 IEEE international conference on multimedia and expo workshops, ICMEW 2017, pp 617–622. https://doi.org/10.1109/ICMEW.2017.8026286

  22. Ding X, Yang K, Chen W (2019) An attention-enhanced recurrent graph convolutional network for skeleton-based action recognition. ACM Int Conf Proc Series:79–84, https://doi.org/10.1145/3372806.3372814

  23. Duan H, Wang J, Chen K, Lin D (2022) PYSKL: towards good practices for skeleton action recognition. arXiv:2205.09443

  24. Duan H, Zhao Y, Chen K, Lin D, Dai B (2021) Revisiting skeleton-based action recognition. arXiv:2104.13586, (1)

  25. Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vis 111(1):98–136

    Article  Google Scholar 

  26. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2007) The pascal visual object classes challenge 2007 results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html. Accessed 05 April 2021

  27. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes challenge 2010 results. http://www.pascal-network.org/challenges/VOC/voc2010/workshop/index.html. Accessed 05 April 2021

  28. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The pascal visual object classes challenge 2012 results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html. Accessed 05 April 2021

  29. Fang HS, Xu Y, Wang W, Liu X, Zhu SC (2018) Learning pose grammar to encode human body configuration for 3D pose estimation. In: Thirty-second AAAI conference on artificial intelligence

  30. Georgakis G, Li R, Karanam S, Chen T, Košecká J, Wu Z (2020) Hierarchical kinematic human mesh recovery. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 12362 LNCS, pp 768–784. https://doi.org/10.1007/978-3-030-58520-4_45

  31. (2019). Geeks forgeeks: linear regression (python implementation). https://www.geeksforgeeks.org/linear-regression-python-implementation/,. Accessed 4 April 2019

  32. (2019). Geometric: geometric transformations. https://pages.mtu.edu/~shene/COURSES/cs3621/NOTES/geometry/geo-tran.html. Accessed 4 April 2019

  33. Girshick R (2015) fast r-CNN. In: Proceedings of the IEEE international conference on computer vision, vol 2015 Inter, pp 1440–1448. https://doi.org/10.1109/ICCV.2015.169

  34. Girshick R, Donahue J, Darrell T, Berkeley UC, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, vol 1, p 5000. https://doi.org/10.1109/CVPR.2014.81

  35. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 580–587. https://doi.org/10.1109/CVPR.2014.81

  36. Gruosso M, Capece N, Erra U (2020) Human segmentation in surveillance video with deep learning. Multimed Tools Appl

  37. Haq EU, Jianjun H, Li K, Haq HU (2020) Human detection and tracking with deep convolutional neural networks under the constrained of noise and occluded scenes. Multimed Tools Appl 79(41-42):30,685–30,708. https://doi.org/10.1007/s11042-020-09579-x

    Article  Google Scholar 

  38. Haque MF, Lim HY, Kang DS (2019) Object detection based on vgg with resnet network. In: 2019 International conference on electronics, information, and communication (ICEIC). Institute of electronics and information engineers (IEIE), pp 1–3

  39. Harshall L (2019) Understanding semantic segmentation with unet, https://towardsdatascience.com/understanding-semantic-segmentation-with/-unet-6be4f42d4b47. Accessed 4 January 2021

  40. He K, Gkioxari G, Dollar P, Girshick R (2017) Mask r-CNN. In: ICCV

  41. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916. https://doi.org/10.1109/TPAMI.2015.2389824

    Article  Google Scholar 

  42. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, 27-30 June 2016. IEEE computer society, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  43. Helten T, Baak A, Bharaj G, Muller M, Seidel HP, Theobalt C (2013) Personalization and evaluation of a real-time depth-based full body tracker. In: Proceedings - 2013 international conference on 3D vision, 3DV 2013, pp 279–286. https://doi.org/10.1109/3DV.2013.44

  44. Hossain MRI, Little JJ (2018) Exploiting temporal information for 3D human pose estimation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11214 LNCS, pp 69–86. https://doi.org/10.1007/978-3-030-01249-6_5

  45. Hu G, Cui B, Yu S (2019) Skeleton-based action recognition with synchronous local and non-local spatio-temporal learning and frequency attention. In: Proceedings - IEEE international conference on multimedia and expo, vol 2019-July, pp 1216–1221. https://doi.org/10.1109/ICME.2019.00212

  46. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  47. Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, Murphy K (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January, pp 3296–3305. https://doi.org/10.1109/CVPR.2017.351

  48. Hung GL, Sahimi MSB, Samma H, Almohamad TA, Lahasan B (2020) Faster R-CNN deep learning model for pedestrian detection from drone images. In: SN computer science. Springer Singapore, vol 1, pp 1–9. https://doi.org/10.1007/s42979-020-00125-y

  49. Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36(7):1325–1339

    Article  Google Scholar 

  50. Iskakov K, Burkov E, Lempitsky VS, Malkov Y (2019) Learnable triangulation of human pose. CoRR arXiv:1905.05754

  51. Jen-Kai T, Chen-Chien H, Wei-Yen W, Shao-Kang H (2020) Deep learning-based real-time multiple-person action recognition system sensors. https://doi.org/10.3390/s20174758

  52. Ji X, Fang Q, Dong J, Shuai Q, Jiang W, Zhou X (2020) A survey on monocular 3D human pose estimation. Virtual Reality and Intelligent Hardware 2(6):471–500. https://doi.org/10.1016/j.vrih.2020.04.005

    Article  Google Scholar 

  53. Jocher G (2021) Head and person detection model, https://github.com/deepakcrk/yolov5-crowdhuman. Accessed 6 Dec 2021

  54. Jonathan L, Evan S, Trevor D (2015) Fully convolutional networks for semantic segmentation. In: Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  55. Khan G, Tariq Z, Usman Ghani Khan M (2019) Multi-Person tracking based on faster R-CNN and deep appearance features. Vis Object Tracking Deep Neural Netw:1–23, https://doi.org/10.5772/intechopen.85215https://doi.org/10.5772/intechopen.85215

  56. Kim BG, Park DJ (2004) Unsupervised video object segmentation and tracking based on new edge features. Pattern Recognit Lett (Elsevier) 25:1731–1742. https://doi.org/10.1016/j.patrec.2004.07.009

    Article  Google Scholar 

  57. Kirillov A, Wu Y, He K, Girshick R (2019) Pointrend: image segmentation as rendering

  58. Kocabas M, Karagoz S, Akbas E (2019) Self-supervised learning of 3D human pose using multi-view geometry. In: IEEE computer vision and pattern recognition, arXiv:1903.02330

  59. Kong Y, Fu Y (2022) Human action recognition and prediction: a survey. Int J Comput Vis 130(5):1366–1401. https://doi.org/10.1007/s11263-022-01594-9

    Article  Google Scholar 

  60. Krizhevsky A, Sutskever I, Hinton GE (2012) Handbook of approximation algorithms and metaheuristics. In: NIPS’12: proceedings of the 25th international conference on neural information processing systems, pp 1–1432. https://doi.org/10.1201/9781420010749

  61. Kundu JN, Seth S, Rahul MV, Rakesh M, Babu RV, Chakraborty A (2020) Kinematic-structure-preserved representation for unsupervised 3d human pose estimation. In: AAAI 2020 - 34Th AAAI conference on artificial intelligence, pp 11,312–11,319. https://doi.org/10.1609/aaai.v34i07.6792

  62. Laplaza Galindo J (2018) Tracking and approaching people using deep learning techniques. In: A thesis presented for the degree of master universitari en enginyeria industrial, september

  63. Leal-Taixe L, Milan A, Reid I, Roth S, Schindler K (2015) MOTChallenge 2015: towards a benchmark for multi-target tracking. arXiv:1504.01942 pp 1–15

  64. Lee Y, Hwang JW, Lee S, Bae Y, Park J (2019) An energy and gpu-computation efficient backbone network for real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

  65. Lee K, Lee I, Lee S (2018) Propagating LSTM: 3D pose estimation based on joint interdependency. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11211 LNCS, pp 123–141. https://doi.org/10.1007/978-3-030-01234-2_8

  66. Lee Y, Park J (2020) Centermask: real-time anchor-free instance segmentation. In: CVPR

  67. Li S, Chan AB (2014) 3D human pose estimation from monocular images with deep convolutional neural network. In: Asian conference on computer vision. https://doi.org/10.1007/978-3-319-16808-1_23

  68. Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection

  69. Li C, Hee Lee G (2019) Generating multiple hypotheses for 3d human pose estimation with mixture density network. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  70. Li C, Lee GH (2019) Generating multiple hypotheses for 3D human pose estimation with mixture density network. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). arXiv:1904.05547

  71. Li W, Liu H, Ding R, Liu M, Wang P, Yang W (2022) exploiting temporal contexts with strided transformer for 3D human pose estimation. IEEE Trans Multimed:1–13, https://doi.org/10.1109/TMM.2022.3141231

  72. Li Y, Xia R, Liu X, Huang Q (2019) Learning shape-motion representations from geometric algebra spatio-temporal model for skeleton-based action recognition. In: Proceedings - IEEE international conference on multimedia and expo, vol 2019-July, pp 1066–1071. https://doi.org/10.1109/ICME.2019.00187

  73. Li C, Xie C, Zhang B, Han J, Zhen X, Chen J (2021) Memory attention networks for skeleton-based action recognition. IEEE Trans Neural Netw Learn Syst:1639–1645, https://doi.org/10.1109/TNNLS.2021.3061115

  74. Li M, Yu C, Wang X (2020) Skeleton-based action recognition with a triple-stream graph convolutional network. In: ACM international conference proceeding series, pp 524–528. https://doi.org/10.1145/3443467.3443809

  75. Li S, Zhang W, Chan AB (2017) Maximum-margin structured learning with deep networks for 3D human pose estimation. Int J Comput Vis 122 (1):149–168. https://doi.org/10.1007/s11263-016-0962-x

    Article  MathSciNet  Google Scholar 

  76. Liang D, Fan G, Lin G, Chen W, Pan X, Zhu H (2019) Three-stream convolutional neural network with multi-task and ensemble learning for 3d action recognition. In: IEEE Computer society conference on computer vision and pattern recognition workshops, vol 2019-june, pp 934–940. https://doi.org/10.1109/CVPRW.2019.00123

  77. Liefeng B, Cristian S (2010) Twin gaussian processes for structured prediction. Int J Comput Vis, vol 87

  78. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 8693 LNCS, pp 740–755

  79. (2019). Linear: linear regression, https://machinelearningcoban.com/2016/12/28/linearregression/. Accessed 4 April 2019

  80. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: European conference on computer vision, vol 9905 LNCS, pp 21–37. https://doi.org/10.1007/978-3-319-46448-0_2

  81. Liu F, Dai Q, Wang S, Zhao L, Shi X, Qiao J (2020) Multi-relational graph convolutional networks for skeleton-based action recognition. In: Proceedings - 2020 IEEE international symposium on parallel and distributed processing with applications, pp 474–480. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00085

  82. Liu J, Shahroudy A, Perez M, Wang G, Duan LY, Kot AC (2020) NTU RGB+d 120: a large-scale benchmark for 3D human activity understanding. In: IEEE transactions on pattern analysis and machine intelligence, vol 42, pp 2684–2701. https://doi.org/10.1109/TPAMI.2019.2916873

  83. Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 140–149. https://doi.org/10.1109/CVPR42600.2020.00022

  84. Martinez J, Hossain R, Romero J, Little JJ (2017) A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE international conference on computer vision, vol 2017-October, pp 2659–2668. https://doi.org/10.1109/ICCV.2017.288

  85. Mehta D, Rhodin H, Casas D, Fua P, Sotnychenko O, Xu W, Theobalt C (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 2017 fifth international conference on 3D vision (3DV)

  86. Mehta D, Sridhar S, Sotnychenko O, Rhodin H, Shafiei M, Seidel HP, Xu W, Casas D, Theobalt C (2017) Vnect: real-time 3d human pose estimation with a single rgb camera. http://gvv.mpi-inf.mpg.de/projects/VNect/. Accessed 05 April 2021

  87. Moon G, Chang JY, Lee KM (2019) Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: Proceedings of the IEEE international conference on computer vision, vol 2019-October, pp 10,132–10,141. https://doi.org/10.1109/ICCV.2019.01023

  88. Neverova N, Novotny D, Vedaldi A (2019) Correlated uncertainty for learning dense correspondences from noisy labels

  89. Nibali A, He Z, Morgan S, Prendergast L (2019) 3D human pose estimation with 2D marginal heatmaps. In: Proceedings - 2019 IEEE winter conference on applications of computer vision, WACV 2019, Figure 1, pp 1477–1485. https://doi.org/10.1109/WACV.2019.00162

  90. Nie Q, Liu Z, Liu Y (2020) Unsupervised 3D human pose representation with viewpoint and pose disentanglement. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 12364 LNCS, pp 102–118. https://doi.org/10.1007/978-3-030-58529-7_7

  91. Nie BX, Wei P, Zhu SC (2017) Monocular 3D human pose estimation by predicting depth on joints. In: Proceedings of the IEEE international conference on computer vision, vol 2017-October, pp 3467–3475. https://doi.org/10.1109/ICCV.2017.373

  92. Omran M, Lassner C, Pons-Moll G, Gehler P, Schiele B (2018) Neural body fitting: unifying deep learning and model based human pose and shape estimation. In: Proceedings - 2018 international conference on 3D vision, 3DV 2018, pp 484–494. https://doi.org/10.1109/3DV.2018.00062

  93. Oreifej O, Liu Z (2013) HON4d: histogram of oriented 4D normals for activity recognition from depth sequences. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 716–723. https://doi.org/10.1109/CVPR.2013.98

  94. Papandreou G, Zhu T, Chen LC, Gidaris S, Tompson J, Murphy K (2018) PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: ECCV

  95. Pavlakos G, Zhou X, Derpanis KG, Daniilidis K (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January, pp 1263–1272. https://doi.org/10.1109/CVPR.2017.139

  96. Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d Human pose estimation in video with temporal convolutions and semi-supervised training. In: Conference on computer vision and pattern recognition (CVPR)

  97. Pavllo D, Grangier D, Auli M (2018) Quaternet: a quaternion-based recurrent model for human motion. In: British machine vision conference (BMVC)

  98. Qin Z, Liu Y, Ji P, Kim D, Wang L, McKay B, Anwar S, Gedeon T (2021) Fusing higher-order features in graph neural networks for skeleton-based action recognition. arXiv:2105.01563 pp 1–15

  99. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Computer vision and pattern recognition

  100. Redmon J, Farhadi A (2016) Yolo9000: better, faster, stronger. arXiv:1612.08242

  101. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement

  102. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems 28, pp 91–99

  103. Ren B, Liu M, Ding R, Liu H (2020) A survey on 3d skeleton-based action recognition using learning method. arXiv:2002.05907, pp 1–8

  104. Renuka J (2021) Accuracy, precision, recall and f1 score: interpretation of performance measures. Accessed 4 January 2016

  105. Rhodin H, Constantin V, Katircioglu I, Salzmann M, Fua P (2019) Neural scene decomposition for multi-person motion capture. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June, pp 7695–7705. https://doi.org/10.1109/CVPR.2019.00789

  106. Rhodin H, Salzmann M, Fua P (2018) Unsupervised geometry-aware representation for 3D human pose estimation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11214 LNCS, pp 765–782. https://doi.org/10.1007/978-3-030-01249-6_46

  107. Riza Alp Guler Natalia Neverova IK (2018) Densepose: dense human pose estimation in the wild

  108. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV) 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  109. Sanchez S, Romero H, Morales A (2020) A review: comparison of performance metrics of pretrained models for object detection using the tensorflow framework. In: IOP Conference series materials science and engineering

  110. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: CVPR

  111. Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+d: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2016-December, pp 1010–1019. https://doi.org/10.1109/CVPR.2016.115

  112. Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun J (2018) CrowdHuman: a benchmark for detecting human in a crowd. arXiv:1805.00123, pp 1–9

  113. Shi L, Zhang Y, Cheng J, Lu H (2019) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June, pp 7904–7913. https://doi.org/10.1109/CVPR.2019.00810

  114. Sigal L, Balan AO, Black MJ (2010) HUMAN EVA : synchronized video and motion capture dataset human motion. Int J Comput Vis 87(1):4–27. https://doi.org/10.1007/s11263-009-0273-6

    Article  Google Scholar 

  115. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd International conference on learning representations, ICLR 2015 - conference track proceedings, pp 1–14

  116. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations

  117. Singh M, Basu A, Mandal MK (2008) Human activity recognition based on silhouette directionality. IEEE Trans Circuits Syst Video Technol 18 (9):1280–1292. https://doi.org/10.1109/TCSVT.2008.928888

    Article  Google Scholar 

  118. Singh M, Mandai M, Basu A (2005) Pose recognition using the radon transform. Midwest Symposium on Circuits Syst 2005:1091–1094. https://doi.org/10.1109/MWSCAS.2005.1594295

    Google Scholar 

  119. Song L, Yu G, Yuan J, Liu Z (2021) Journal of visual communication and image representation human pose estimation and its application to action recognition : a survey. J Vis Commun Image Representation 76:103,055. https://doi.org/10.1016/j.jvcir.2021.103055

    Article  Google Scholar 

  120. Song YF, Zhang Z, Shan C, Wang L (2020) Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. In: MM 2020 - proceedings of the 28th ACM international conference on multimedia, pp 1625–1633. https://doi.org/10.1145/3394171.3413802

  121. Song YF, Zhang Z, Wang L (2019) Richly activated graph convolutional network for action recognition with incomplete skeletons. Proc Int Conf Image Process ICIP 2019:1–5. https://doi.org/10.1109/ICIP.2019.8802917

    Google Scholar 

  122. Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Eccv

  123. Tekin B, Katircioglu I, Salzmann M, Lepetit V, Fua P (2016) Structured prediction of 3D human pose with deep neural networks. In: British machine vision conference 2016, BMVC 2016, vol 2016-september, pp 130.1–130.11. https://doi.org/10.5244/C.30.130

  124. Tekin B, Marquez-Neila P, Salzmann M, Fua P (2017) learning to fuse 2D and 3D image cues for monocular body pose estimation. In: Proceedings of the IEEE international conference on computer vision, vol 2017-October, pp 3961–3970. https://doi.org/10.1109/ICCV.2017.425

  125. Thanh NT, Húng LV, Công PT (2019) An evaluation of pose estimation in video of traditional martial arts presentation. J Res Develop Inf Commun Technol 2019(2):114–126. https://doi.org/10.32913/mic-ict-research.v2019.n2.864

    Article  Google Scholar 

  126. Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: Proceeding international conference computer vision (ICCV)

  127. Tian Z, Shen C, Chen H, He T (2021) FCOS: a simple and strong anchor-free object detector

  128. Tome D, Russell C, Agapito L (2017) Lifting from the deep: convolutional 3d pose estimation from a single image. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  129. Tome D, Russell C, Agapito L (2017) Lifting from the deep: convolutional 3D pose estimation from a single image. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January, pp 5689–5698. https://doi.org/10.1109/CVPR.2017.603

  130. Véges M, Varga V, Lő rincz A (2018) 3d human pose estimation with siamese equivariant embedding. arXiv:1809.07217

  131. Wandt B, Rosenhahn B (2019) Repnet: weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In: Computer vision and pattern recognition (CVPR)

  132. Wandt B, Rosenhahn B (2019) Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. CoRR arXiv:1902.09868

  133. Wang H (2017) Detection of humans in video streams using convolutional neural networks. Degree Project Compu Sci Eng

  134. Wang L, Chen Y, Guo Z, Qian K, Lin M, Li H, Ren JS (2019) Generalizing monocular 3d human pose estimation in the wild. arXiv:1904.05512

  135. Wang J, Huang S, Wang X, Tao D (2019) Not all parts are created equal: 3D pose estimation by modeling bi-directional dependencies of body parts. In: Proceedings of the IEEE international conference on computer vision, vol 2019-Octob, pp 7770–7779. https://doi.org/10.1109/ICCV.2019.00786

  136. Wang K, Lin L, Jiang C, Qian C, Wei P (2019) 3d Human pose machines with self-supervised learning. IEEE Trans Pattern Anal Mach Intell

  137. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1290–1297. https://doi.org/10.1109/CVPR.2012.6247813

  138. Wang J, Tan S, Zhen X, Xu S, Zheng F, He Z, Shao L (2021) Deep 3d human pose estimation: a review. Comput Vis Image Understand, p 103225

  139. Wang Y, Wang T (2020) Cycle fusion network for multi-person pose estimation. J Phys Conf Series, vol 1550(3)

  140. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 9912 LNCS, pp 20–36. https://doi.org/10.1007/978-3-319-46484-8_2

  141. Wang X, Zhong Y, Jin L, Xiao Y (2019) Scale adaptive graph convolutional network for skeleton-based action recognition. In: CVPR19, vol 55, pp 306–312. https://doi.org/10.11784/tdxbz202012073

  142. Watada J, Musa Z, Jain LC, Fulcher J (2010) Human tracking: a state-of-art survey. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 6277 LNAI, pp 454–463. https://doi.org/10.1007/978-3-642-15390-7_47

  143. Willett NS, Shin HV, Jin Z, Li W, Finkelstein A (2020) Pose2Pose: pose selection and transfer for 2d character animation. In: International conference on intelligent user interfaces, proceedings IUI, pp 88–99. https://doi.org/10.1145/3377325.3377505

  144. Wojke N, Bewley A (2018) Deep cosine metric learning for person re-identification. In: 2018 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 748–756. https://doi.org/10.1109/WACV.2018.00087

  145. Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International conference on image processing (ICIP). IEEE, pp 3645–3649. https://doi.org/10.1109/ICIP.2017.8296962

  146. Wu Y, Kirillov A, Massa F, Lo WY, Girshick R (2019) Detectron2. https://github.com/facebookresearch/detectron2. Accessed 05 April 2021

  147. Xu Y, Cheng J, Wang L, Xia H, Liu F, Tao D (2018) Ensemble one-dimensional convolution neural networks for skeleton-based action recognition. IEEE Signal Process Lett 25(7):1044–1048. https://doi.org/10.1109/LSP.2018.2841649

    Article  Google Scholar 

  148. Xu J, Wang R, Rakheja V (2019) Literature Review: human segmentation with static camera. arXiv:1910.12945v1, pp 1–11

  149. Xu J, Yu Z, Ni B, Yang J, Yang X, Zhang W (2020) Deep kinematics analysis for monocular 3D human pose estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 896–905. https://doi.org/10.1109/CVPR42600.2020.00098

  150. Xu Y, Zhou X, Chen S, Li F (2019) Deep learning for multiple object tracking: a survey. IET Comput Vis 13(4):411–419. https://doi.org/10.1049/iet-cvi.2018.5598

    Article  Google Scholar 

  151. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. 32nd AAAI Conf Artif Intell AAAI vol 2018, pp 7444–7452

  152. Yang F, Wu Y, Sakti S, Nakamura S (2019) Make skeleton-based action recognition model smaller, faster and better. In: 1st ACM international conference on multimedia in asia, MMAsia 2019, vol 15, pp 1–6. https://doi.org/10.1145/3338533.3366569

  153. Yao R, Lin G, Xia S, Zhao J, Zhou Y (2019) Video object segmentation and tracking: a survey vol 1(1)

  154. Ye M, Shen Y, Du C, Pan Z, Yang R (2016) Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. IEEE Trans Pattern Anal Mach Intell 38(8):1517–1532. https://doi.org/10.1109/TPAMI.2016.2557783

    Article  Google Scholar 

  155. Yuan Y, Chu J, Leng L, Miao J, Kim BG (2020) A scale-adaptive object-tracking algorithm with occlusion detection. EURASIP J Image Video Process (Springer)

  156. Zeng A, Sun X, Yang L, Zhao N, Liu M, Xu Q (2021) Learning skeletal graph neural networks for hard 3D pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp 11,416–11,425. https://doi.org/10.1109/ICCV48922.2021.01124

  157. Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell 41(8):1963–1978. https://doi.org/10.1109/TPAMI.2019.2896631

    Article  Google Scholar 

  158. Zhang P, Lan C, Zeng W, Xing J, Xue J, Zheng N (2020) Semantics-guided neural networks for efficient skeleton-based human action recognition. Proc IEEE Comput Society conf Comput Vis Pattern recognit:1109–1118. https://doi.org/10.1109/CVPR42600.2020.00119

  159. Zhang SH, Li R, Dong X, Rosin P, Cai Z, Han X, Yang D, Huang H, Hu SM (2019) Pose2Seg: detection free human instance segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June, pp 889–898. https://doi.org/10.1109/CVPR.2019.00098

  160. Zhang Z, Liu S, Liu S, Han L, Shao Y, Zhou W (2015) Human action recognition using salient region detection in complex scenes. Lecture Notes Electr Eng 322:565–572. https://doi.org/10.1007/978-3-319-08991-1_58

    Article  Google Scholar 

  161. Zhang W, Liu Z, Zhou L, Leung H, Chan AB (2017) Martial arts, dancing and sports dataset: a challenging stereo and multi-view dataset for 3D human pose estimation. Image Vis Comput, vol 61. https://doi.org/10.1016/j.imavis.2017.02.002

  162. Zhang H, Sciutto C, Agrawala M, Fatahalian K (2021) Vid2Player: controllable video sprites that behave and appear like professional tennis players. ACM Trans Graph 40(3):1–16. https://doi.org/10.1145/3448978

    Google Scholar 

  163. Zhang W, Shang L, Chan AB (2014) a robust likelihood function for 3D human pose tracking. IEEE Trans Image Process 23(12):5374–5389

    Article  MathSciNet  MATH  Google Scholar 

  164. Zhang HB, Zhang YX, Zhong B, Lei Q, Yang L, Du JX, Chen DS (2019) A comprehensive survey of vision-based human action recognition methods. Sensors (Switzerland) 19(5):1–20. https://doi.org/10.3390/s19051005

    Article  Google Scholar 

  165. Zhang X, Zou J, He K, Sun J (2016) Accelerating very deep convolutional networks for classification and detection. IEEE Trans Pattern Anal Mach Intell 38(10):1943–1955. https://doi.org/10.1109/TPAMI.2015.2502579

    Article  Google Scholar 

  166. Zhao L, Peng X, Tian Y, Kapadia M, Metaxas DN (2019) Semantic graph convolutional networks for 3D human pose regression. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June, pp 3420–3430. https://doi.org/10.1109/CVPR.2019.00354

  167. Zheng C, Wu W, Chen C, Yang T, Zhu S, Shen J, Kehtarnavaz N, Shah M (2018) Deep learning-based human pose estimation : a survey. J ACM, vol 37(4)

  168. Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z (2021) 3D human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE international conference on computer vision (ICCV), vol 1. arXiv:2103.10455

  169. Zhou K, Han X, Jiang N, Jia K, Lu J (2019) HEMlets pose: learning part-centric heatmap triplets for accurate 3D human pose estimation. In: Proceedings of the IEEE international conference on computer vision, vol 2019-October, pp 2344–2353. https://doi.org/10.1109/ICCV.2019.00243

  170. Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: Proceedings of the IEEE international conference on computer vision, vol 2017-October, pp 398–407. https://doi.org/10.1109/ICCV.2017.51

  171. Zhu J, Zou W, Xu L, Hu Y, Zhu Z, Chang M, Huang J, Huang G, Du D (2018) Action machine: rethinking action recognition in trimmed videos. arXiv:1812.05770

Download references

Funding

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.01-2019.315.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Van-Hung Le.

Ethics declarations

Conflict of Interests

The article is an author’s own survey, not related to any organization or individual. It is part of a series of studies on 3D human pose estimation and human activity recognition in 3D space.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le, VH. Deep learning-based for human segmentation and tracking, 3D human pose estimation and action recognition on monocular video of MADS dataset. Multimed Tools Appl 82, 20771–20818 (2023). https://doi.org/10.1007/s11042-022-13921-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13921-w

Keywords

Navigation