Skip to main content
Log in

Practical 3D human skeleton tracking based on multi-view and multi-Kinect fusion

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

In this paper, we proposed a multi-view system for 3D human skeleton tracking based on multi-cue fusion. Multiple Kinect version 2 cameras are applied to build up a low-cost system. Though Kinect cameras can detect 3D skeleton from their depth sensors, some challenges of skeleton extraction still exist, such as left–right confusion and severe self-occlusion. Moreover, human skeleton tracking systems often have difficulty in dealing with lost tracking. These challenges make robust 3D skeleton tracking nontrivial. To address these challenges in a unified framework, we first correct the skeleton's left–right ambiguity by referring to the human joints extracted by OpenPose. Unlike Kinect, and OpenPose extracts target joints by learning-based image analysis to differentiate a person's front side and backside. With help from 2D images, we can correct the left–right skeleton confusion. On the other hand, we find that self-occlusion severely degrades Kinect joint detection owing to incorrect joint depth estimation. To alleviate the problem, we reconstruct a reference 3D skeleton by back-projecting the corresponding 2D OpenPose joints from multiple cameras. The reconstructed joints are less sensitive to occlusion and can be served as 3D anchors for skeleton fusion. Finally, we introduce inter-joint constraints into our probabilistic skeleton tracking framework to trace all joints simultaneously. Unlike conventional methods that treat each joint individually, neighboring joints are utilized to position each other. In this way, when joints are missing due to occlusion, the inter-joint constraints can ensure the skeleton consistency and preserve the length between neighboring joints. In the end, we evaluate our method with five challenging actions by building a real-time demo system. It shows that the system can track skeletons stably without error propagation and vibration. The experimental results also reveal that the average localization error is smaller than that of conventional methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Meyer, J., Kuderer, M., Muller, J., Burgard, W.: Online marker labeling for fully automatic skeleton tracking in optical motion capture. In: Proc. the IEEE international conference on robotics and automation (ICRA), pp. 5652–5657 (2014)

  2. Canton-Ferrer, C., Casas, J.R., Pardas, M.: Marker-based human motion capture in multiview sequences. EURASIP J. Adv. Signal Process. 2010, 105476–105487 (2010)

    Article  Google Scholar 

  3. Yunardi, R.T., Winarno, A.: Marker-based motion capture for measuring joint kinematics in leg swing simulator. In: Proc. IEEE International conference on instrumentation, control, and automation, pp. 13–17 (2017)

  4. Alexiadis, D.S., Kelly, P., Daras, P., O’Connor, N.E., Boubekeur, T., Moussa, M.B.: Evaluating a dancer's performance using Kinect-based skeleton tracking. In: Proc. ACM international conference on multimedia, pp. 659–662 (2011)

  5. Yang, L., Yang, B., Dong, H., El-Saddik, A.: 3-D markerless tracking of human gait by geometric trilateration of multiple Kinects. IEEE Syst. J. 12, 1393–1403 (2018)

    Article  Google Scholar 

  6. Yoshida, A., Kim, H., Tan, J.K., Ishikawa, S.: Person tracking on Kinect images using particle filter. In: Joint 7th international conference on soft computing and intelligent systems (SCIS) and 15th international symposium on advanced intelligent systems (2014)

  7. Sundaresan, A., Chellappa, R.: Model driven segmentation of articulating humans in Laplacian eigenspace. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1771–1785 (2008)

    Article  Google Scholar 

  8. Moon, S., Park, Y., Ko, D.W., Suh, I.H.: Multiple Kinect sensor fusion for human skeleton tracking using Kalman filtering. Int. J. Adv. Robot. Syst. 13, 65 (2016)

    Article  Google Scholar 

  9. Baek, S., Kim, M.: "Dance experience system using multiple Kinects. Int. J. Future Comput. Commun. 2015, 45–49 (2015)

    Article  Google Scholar 

  10. Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 7291–7299 (2017)

  11. Tang, Z., Gu, R., Hwang, J.: Joint multi-view people tracking and pose estimation for 3D scene reconstruction. In: IEEE international conference on multimedia and expo (ICME), San Diego, pp. 1–6 (2018)

  12. Mousse, M.A., Motamed, C., Ezin, E.C.: A multi-view human bounding volume estimation for posture recognition in elderly monitoring system. In: International conference on pattern recognition systems (2016)

  13. Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Harvesting multiple views for marker-less 3D human pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 6988–6997 (2017)

  14. Rogez, G., Schmid, C.: MoCap-guided data augmentation for 3D pose estimation in the wild. http://arxiv.org/abs/1607.02046 (2016)

  15. Samir, M., Golkar, E., Rahni, A.A.A.: Comparison between the Kinect™ V1 and Kinect™ V2 for respiratory motion tracking. In: IEEE international conference on signal and image processing applications (ICSIPA), Kuala Lumpur, pp. 150–155 (2015)

  16. Wei, T., Lee, B., Qiao, Y., Kitsikidis, A., Dimitropoulos, K., Grammalidis, N.: Experimental study of skeleton tracking abilities from Microsoft Kinect non-frontal views. In: 3DTV-conference: the true vision - capture, transmission and display of 3D video (3DTV-CON), Lisbon, pp. 1–4 (2015)

  17. Ding, P., Song, Y.: Robust object tracking using color and depth images with a depth based occlusion handling and recovery. In: International conference on fuzzy systems and knowledge discovery (FSKD), Zhangjiajie, pp. 930–935 (2015)

  18. Li, K., Wang, M., Lai, Y., Yang, J., Wu, F.: 3-D motion recovery via low rank matrix restoration on articulation graphs. In: IEEE international conference on multimedia and expo (ICME), Hong Kong, pp. 721–726 (2017)

  19. Jatesiktat, P., Anopas, D., Ang, W.T.: Personalized markerless upper-body tracking with a depth camera and wrist-worn inertial measurement units. In: International conference of the IEEE engineering in medicine and biology society (EMBC), pp. 1–6 (2018)

  20. Jatesiktat, P., Ang, W.T.: Recovery of forearm occluded trajectory in Kinect using a wrist-mounted inertial measurement unit. In: International conference of the IEEE engineering in medicine and biology society (EMBC), pp. 807–812 (2017)

  21. Liu, Z., Huang, J., Han, J., Bu, S., Lv, J.: Human motion tracking by multiple RGBD cameras. IEEE Trans. Circuits Syst. Video Technol. 27(9), 2014–2027 (2017)

    Article  Google Scholar 

  22. Yang, B., Dong, H., El Saddik, A.: Development of a self-calibrated motion capture system by nonlinear trilateration of multiple Kinects v2. IEEE Sens. J. 17(8), 2481–2491 (2017)

    Article  Google Scholar 

  23. Wu, Y., Gao, L., Hoermann, S., Lindeman, R.W.: Towards robust 3D skeleton tracking using data fusion from multiple depth sensors. In: International conference on virtual worlds and games for serious applications (VS-Games), pp. 1–4, Wurzburg (2018)

  24. Baek, S., Kim, M.: Real-time performance capture using multiple Kinects. In: International conference on information and communication technology convergence (ICTC), Busan, pp. 647–648 (2014)

  25. Yang, L., Yang, B., Dong, H., Saddik, A.E.: 3-D markerless tracking of human gait by geometric trilateration of multiple Kinects. IEEE Syst. J. 1393–1403, 2018 (2018)

    Google Scholar 

  26. Jiang, Y., Russell, D., Godisart, T., Kholgade Banerjee, N., Banerjee, S.: Hardware synchronization of multiple Kinects and microphones for 3D audiovisual spatiotemporal data capture. In: IEEE international conference on multimedia and expo (ICME), pp. 1–6 (2018)

  27. Otto, M., Agethen, P., Geiselhart, F., Rukzio, E.: Towards ubiquitous tracking: Presenting a scalable markerless tracking approach using multiple depth cameras. In: Proceedings of EuroVR 2015 (European Association for Virtual Reality and Augmented Reality) (2015)

  28. Kitsikidis, A., Dimitropoulos, K., Douka, S., Grammalidis, N.: Dance analysis using multiple Kinect sensors. In: International conference on computer vision theory and applications (VISAPP), pp. 789–795 (2014)

  29. Yeung, K.-Y., Kwok, T.-H., Wang, C.C.L.: Improved skeleton tracking by duplex kinects: a practical approach for real-time applications. J. Comput. Inf. Sci. Eng. 13, 4 (2013)

    Article  Google Scholar 

  30. Asteriadis, S., Chatzitofis, A., Zarpalas, D., Alexiadis, D.S., Daras, P.: Estimating human motion from multiple Kinect sensors. In: Proceedings of the 6th international conference on computer vision/computer graphics collaboration techniques and applications, pp. 1–6 (2013)

  31. Li, S., Pathirana, P.N., Caelli, T.: Multi-kinect skeleton fusion for physical rehabilitation monitoring. In: International Conference of the IEEE engineering in medicine and biology society, pp. 5060–5063 (2014)

  32. Kowalski, M., Naruniec, J., Daniluk, M.: Livescan3d: a fast and inexpensive 3d data acquisition system for multiple Kinect v2 sensors. In: International Conference on 3D vision, pp. 318–325 (2015)

  33. acm.cs.nctu.edu.tw/Demo_kinect.aspx (2020)

  34. Penate-Sanchez, A., Andrade-Cetto, J., Moreno-Noguer, F.: Exhaustive linearization for robust camera pose and focal length estimation. IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2387–2400 (2013)

    Article  Google Scholar 

  35. Stoll, C., Hasler, N., Gall, J., Seidel, H.P., Theobalt, C.: Fast articulated motion tracking using a sums of Gaussians body model. In: International conference on computer vision, Barcelona, pp. 951–958 (2011)

  36. Malleson, C., Gilbert, A., Trumble, M., Collomosse, J., Hilton, A., Volino, M.: Real-time full-body motion capture from video and IMUs. In: International conference on 3D vision (3DV), Qingdao, pp. 449–457 (2017)

  37. Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.-P., Xu, W., Casas, D., Theobalt, C.: VNect: real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graphic. 35, 4 (2017)

    Google Scholar 

  38. Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, M., Fua, P., Seidel, H.-P., Rhodin, H., Pons-Moll, G., Theobalt, C.: XNect: real-time multi-person 3D motion capture with a single RGB camera. ACM Trans. Graph. 39, 4 (2020)

    Article  Google Scholar 

  39. Bogo, F., Black, M.J., Loper, M., Romero, J.: Detailed full-body reconstructions of moving people from monocular RGB-D sequences. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2300–2308 (2015)

  40. Joo, H., Simon, T., Sheikh, Y.: Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: IEEE conference on computer vision and pattern recognition (CVPR) (2018)

  41. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2014)

    Article  Google Scholar 

  42. Ionescu, C., Li, F., Sminchisescu, C.: Latent structured models for human pose estimation. In: 2011 international conference on computer vision, Barcelona, pp. 2220–2227 (2011)

  43. Albert, J.A., Owolabi, V., Gebel, A., Brahms, C.M., Granacher, U., Arnrich, B.: Evaluation of the pose tracking performance of the Azure Kinect and Kinect v2 for gait analysis in comparison with a gold standard: a pilot study. Sensors 20(18), 5104 (2020)

    Article  Google Scholar 

  44. Soleimani, V., Mirmehdi, M., Damen, D., Hannuna, S., Camplani, M.: 3Ddata acquisition and registration using two opposing Kinects. In: International conference on 3D vision (3DV), Stanford, CA, pp. 128–137 (2016)

  45. Alexiadis, D.S., Chatzitofis, A., Zioulis, N., Zoidi, O., Louizis, G., Zarpalas, D., Daras, P.: An integrated platform for live 3D human reconstruction and motion capturing. IEEE Trans. Circ. Syst. Video Technol. 27(4), 798–813 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

This paper is supported by the Ministry of Science and Technology (MOST), Taiwan, under MOST-109-2221-E-009-112-MY3, MOST-110-2221-E-A49-066-MY3, MOST-110-2218-E-A49-018-MY3, MOST-110-2634-F-009-023-MY3, and MOST-110-2221-E-A49-065-MY3. This work was supported by the Higher Education Sprout Project of the National Yang-Ming Chiao Tung University and Ministry of Education (MOE), Taiwan. This research is supported by Ho Chi Minh City University of Technology and Education (HCMUTE), Vietnam.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ching-Chun Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Estimate an anchor point for the universal OpenPose skeleton

Denote \(J = \mathop \sum \nolimits_{i = 1}^{N} d_{i}^{2} \left( {R,\overrightarrow {{L_{i} }} } \right)\), the solution \(R^{*} = \left( {x_{R} ,y_{R} ,z_{R} } \right)\) of Eq. (8) can be estimated efficiently by solving the linear equations \({\raise0.7ex\hbox{${\partial J}$} \!\mathord{\left/ {\vphantom {{\partial J} {\partial x_{R} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\partial x_{R} }$}} = 0\), \({\raise0.7ex\hbox{${\partial J}$} \!\mathord{\left/ {\vphantom {{\partial J} {\partial y_{R} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\partial y_{R} }$}} = 0\), and \({\raise0.7ex\hbox{${\partial J}$} \!\mathord{\left/ {\vphantom {{\partial J} {\partial z_{R} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\partial z_{R} }$}} = 0\). To simplify, we ignore the joint index. For instance, \(O_{i}^{3D}\) means the position of a 3D-OpenPose joint which is generated from the ith camera, and the notation can apply for any joint in the ith 3D-OpenPose skeleton. Denote the point \(O_{i}^{3D}\) by \(\left( {x_{{O_{i} }} ,y_{{O_{i} }} ,z_{{O_{i} }} } \right)\), the ith camera center \(T_{i} { }\) by \(\left( {x_{{T_{i} }} ,y_{{T_{i} }} ,z_{{T_{i} }} } \right)\), the back-projection vector from the camera center to a particular joint is \(\overrightarrow {{T_{{\varvec{i}}} O_{i}^{3D} }} = \vec{u} = \left( {x_{{O_{i} }} - x_{{T_{i} }} ,{ }y_{{O_{i} }} - y_{{T_{i} }} ,{ }z_{{O_{i} }} - z_{{T_{i} }} } \right)\). The matrix form of these linear equations is in Eq. (18)

$${ }\left[ {\begin{array}{*{20}c} {2\mathop \sum \limits_{i} A_{i} } & {\mathop \sum \limits_{i} D_{i} } & { - 1} \\ {2\mathop \sum \limits_{i} B_{i} } & {\mathop \sum \limits_{i} E_{i} } & { - 1} \\ {2\mathop \sum \limits_{i} C_{i} } & {\mathop \sum \limits_{i} F_{i} } & { - 1} \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {x_{R} } \\ {y_{R} } \\ {z_{R} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\mathop \sum \limits_{i} G_{i} } \\ {\mathop \sum \limits_{i} H} \\ {\mathop \sum \limits_{i} K_{i} } \\ \end{array} } \right]$$
(18)

where

$$A_{i} = \frac{{\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + { }\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}$$
(19)
$$B_{i} = \frac{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + { }\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}$$
(20)
$$C_{i} = \frac{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + \left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} }}{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + { }\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}$$
(21)
$$D_{i} = \frac{{2\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)\left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)}}{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + { }\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}$$
(22)
$$E_{i} = \frac{{2\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)\left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)}}{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + { }\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}$$
(23)
$$F_{i} = \frac{{2\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)\left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)}}{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + { }\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}$$
(24)
$$G_{i} = \frac{{2\left[ {\left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)\left( {z_{{O_{i} }} x_{{T_{i} }} - x_{{O_{i} }} z_{{T_{i} }} } \right) - \left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)\left( {x_{{O_{i} }} y_{{T_{i} }} - y_{{O_{i} }} x_{{T_{i} }} } \right)} \right]}}{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + { }\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}$$
(25)
$$H_{i} = \frac{{2\left[ {\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)\left( {x_{{O_{i} }} y_{{T_{i} }} - y_{{O_{i} }} x_{{T_{i} }} } \right) - \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)\left( {y_{{O_{i} }} z_{{T_{i} }} - z_{{O_{i} }} y_{{T_{i} }} } \right)} \right]}}{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + { }\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}$$
(26)
$$K_{i} = \frac{{2\left[ {\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)\left( {y_{{O_{i} }} z_{{T_{i} }} - z_{{O_{i} }} y_{{T_{i} }} } \right) - \left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)\left( {z_{{O_{i} }} x_{{T_{i} }} - x_{{O_{i} }} z_{{T_{i} }} } \right)} \right]}}{{\left( {x_{{O_{i} }} - x_{{T_{i} }} } \right)^{2} + { }\left( {y_{{O_{i} }} - y_{{T_{i} }} } \right)^{2} + \left( {z_{{O_{i} }} - z_{{T_{i} }} } \right)^{2} }}.$$
(27)

Hence, the optimal solution \(R^{*} = \left( {x_{R} ,y_{R} ,z_{R} } \right)\) could be estimated by Eq. (28)

$$\left[ {\begin{array}{*{20}c} {x_{R} } \\ {y_{R} } \\ {z_{R} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {2\mathop \sum \limits_{i} A_{i} } & {\mathop \sum \limits_{i} D_{i} } & { - 1} \\ {2\mathop \sum \limits_{i} B_{i} } & {\mathop \sum \limits_{i} E_{i} } & { - 1} \\ {2\mathop \sum \limits_{i} C_{i} } & {\mathop \sum \limits_{i} F_{i} } & { - 1} \\ \end{array} } \right]^{ - 1} \left[ {\begin{array}{*{20}c} {\mathop \sum \limits_{i} G_{i} } \\ {\mathop \sum \limits_{i} H} \\ {\mathop \sum \limits_{i} K_{i} } \\ \end{array} } \right]$$
(28)

Appendix B: Mathematic notation

\(R_{{c_{i} }}^{w}\)

The rotation from the ith camera (ci) coordination to the world coordination

\(T_{{c_{i} }}^{w}\)

The translation from the ith camera (ci) coordination to the world coordination

\(X_{w}\)

A 3D point in the world coordinate;

\(X_{{c_{i} }}\)

The corresponding 3D point of \(X_{w}\) in the ith camera (ci) coordinate

\(R_{{f_{i,j} }}\)

\(T_{{f_{i,j} }}\)

The refinement parameters that compensate the distance between the ith point cloud and the jth point cloud in the world coordinate

\(E_{i}^{j,3D}\)

The position of the jth 3D Kinect joint from the ith camera in the world coordinate

\({\text{Ske}}\_K_{i}\)

The Kinect skeleton of the ith camera

\({\text{Ske}}\_K_{i,R}\)

The skeleton whose joint indexes are left–right exchanged of the ith camera

\({\text{Ske}}\_O_{i}\)

The OpenPose skeleton of the ith camera

\({\text{Ske}}\_O_{U}\)

The universal OpenPose skeleton

\(O_{i}^{j,2D}\)

The position of the jth 2D OpenPose joint which is extracted from the ith camera

\(O_{i}^{j,3D}\)

The position of the jth 3D OpenPose joint which is generated from the ith depth camera

\(R^{j}\)

The reference point of the jth 3D joint

\(E_{i}^{j}\)

The 3D Kinect joints with ID “j

\(E_{*}^{j}\)

The fusion result of the jth joint

\(\overrightarrow {{L_{i} }}\)

The ray passing through the ith camera center to the 2D joint \(O_{i}^{2D}\)

\(d_{i}^{{}} \left( {R,L_{i} } \right)\)

distance from a point R to a line \(\overrightarrow {{L_{i} }}\)

\(N\)

The number of Kinect cameras

\(M\)

The number of interesting joints

\(J_{t}^{j}\)

The state vector of the \(j\)th joint at the time t

\(\hat{J}_{t}^{j}\)

The optimal state vector of the \(j\)th joint at the time t

\(A\)

Motion matrix

\(H\)

Measurement matrix

\(Q\)

The covariance of the process noise

\(\sigma_{R}\)

The covariance of the observation noise

\(K_{t}^{j}\)

The Kalman Gain of the jth joint at time t

\(V_{t}^{j}\)

The error covariance of the jth joint at the time t

\(\mu_{t}^{j} ,{\Psi }_{t}^{j}\)

The estimation of the state vector and the error covariance matrix for the jth joint at the time t

Appendix C: Skeleton tracking with the inter-joint constraints

By Bayes’ theorem, Eq. (11) can be decomposed as Eq. (29)

$$\begin{aligned} \hat{J}_{t}^{j} & = {\text{arg}}\mathop {\max }\limits_{{J_{t}^{j} }} P\left( {J_{t}^{j} {|}E_{{{*},{\text{t}}}}^{j} ,\hat{J}_{t - 1,}^{j} { }\hat{J}_{t}^{k,k \in Nb\left( j \right)} } \right) \\ & = \arg \mathop {\max }\limits_{{J_{t}^{j} }} \underbrace {{P\left( {E_{*,t}^{j} |J_{t}^{j} } \right)}}_{{\text{likelihood term}}} \cdot \underbrace {{P\left( {J_{t}^{j} |E_{*,1:t - 1}^{j} } \right) \cdot \prod_{k \in Nb\left( j \right)}^{{{\mathop \prod \limits_{k \in Nb\left( j \right)}{}} P\left( {J_{t}^{j} |\hat{J}_{t}^{k} } \right)}} P\left( {J_{t}^{j} |\hat{J}_{t}^{k} } \right)}}_{{\text{prior term}}} \\ \end{aligned}$$
(29)

Here, \(P\left( {E_{{{*},{\text{t}}}}^{j} {|}J_{t}^{j} } \right)\) is the likelihood term telling the probability of having the measurement \(E_{{{*},{\text{t}}}}^{j}\) given the state vector \(J_{t}^{j} .\) For computing efficiency, we model the term as a Gaussian distribution defined in (30)

$$P\left( {E_{{{*},{\text{t}}}}^{j} {|}J_{t}^{j} } \right) = {\mathcal{N}}\left( {E_{{{*},{\text{t}}}}^{j} {|}H \cdot J_{t}^{j} ,\sigma_{R} } \right),$$
(30)

where \(H = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ \end{array} } \right]\) is the observation matrix which maps from the state vector \(J_{t}^{j}\) to the measurement space. \(\sigma_{R}\) is the covariance of the Gaussian distribution, which implies the confidence of the measurement (the covariance of the observation noise). Basically, we hope the 3D position \(\left[ {\begin{array}{*{20}c} {X^{j} } & {Y^{j} } & {Z^{j} } \\ \end{array} } \right]_{t}^{T}\) of the state vector \(J_{t}^{j}\) is close to the extracted measurement \(E_{{{*},{\text{t}}}}^{j}\). On the other hand, we have the temporal prior term \(P(J_{t}^{j} |E_{*,1:t - 1}^{j} )\) and the spatial prior term \(\mathop \prod \limits_{k \in Nb\left( j \right)}^{{}} P(J_{t}^{j} |\hat{J}_{t}^{k} )\) in Eq. (29). Through the two prior probabilities, the estimation of \(J_{t}^{j}\) is not only correlated to its previous measurements (\(E_{*,1:t - 1}^{j}\)) but also spatially dependent on the other joints’ estimations (\({ }\hat{J}_{t}^{k}\)) at time t. Especially, the correlation to the previous estimations (i.e., the prediction step) can be further written as follows based on Chapman–Kolmogorov equation

$${\text{P}}\left( {J_{t}^{j} {|}E_{*,1:t - 1}^{j} } \right) = \int {P\left( {J_{t}^{j} |J_{t - 1}^{j} } \right) \cdot P\left( {J_{t - 1}^{j} {|}E_{*,1:t - 1}^{j } } \right){\text{d}}J_{t - 1}^{j} }$$
(31)

Next, we assume that the previous posterior distribution \(P\left( {J_{t - 1}^{j} {|}E_{*,1:t - 1}^{j } } \right)\) at time t-1 and \(P(J_{t}^{j} |J_{t - 1}^{j} )\) are Gaussians with

$$P\left( {J_{t - 1}^{j} {|}E_{*,1:t - 1}^{j } } \right) = \begin{array}{*{20}c} {{\mathcal{N}}\left( {J_{t - 1}^{j} |\hat{J}_{t - 1}^{j} ,V_{t - 1}^{j} } \right)\;{\text{and}}\;P\left( {J_{t}^{j} {|}J_{t - 1}^{j} } \right) = { }{\mathcal{N}}\left( {J_{t}^{j} {|}A \cdot J_{t - 1}^{j} ,Q} \right)} \\ \end{array}$$
(32)

In Eq. (32), \(\hat{J}_{t - 1}^{j}\) is the optimal state estimation of the jth joint at time \(t - 1\) and \(V_{t - 1}^{j}\) is the error covariance of the jth joint at the time \(t - 1\). Moreover, \(A \cdot J_{t - 1}^{j}\) represents the estimation of \(J_{t}^{j}\) from \(J_{t - 1}^{j}\) based on motion prediction; \(A\) is a typical motion matrix used in Kalman filter, and \(Q\) is the covariance of the process noise as mentioned in Sect. 3.2.4. Now, we can rewrite Eq. (31) as

$$\begin{aligned} P\left( {J_{t}^{j} {|}E_{*,1:t - 1}^{j} } \right) &= \int {{\mathcal{N}}\left( {J_{t}^{j} {|}A \cdot J_{t - 1}^{j} ,Q} \right) \cdot {\mathcal{N}}\left( {J_{t - 1}^{j} {|}\hat{J}_{t - 1}^{j} ,V_{t - 1}^{j} } \right) \cdot {\text{d}}J_{t - 1}^{j} } \hfill \\ &= { \mathcal{N}}\left( {J_{t}^{j} |A \cdot \hat{J}_{t - 1}^{j} ,AV_{t - 1}^{j} A^{T} + Q} \right) \hfill \\ \end{aligned}$$
(33)

On the other hand, the dependency on the neighboring joints is also assumed to be Gaussian. That is

$$P\left( {J_{t}^{j} {| }\hat{J}_{t}^{k,k \in Nb\left( j \right)} } \right) = {\mathcal{N}}\left( {J_{t}^{j} {|}\hat{J}_{t}^{k} + \Delta_{t - 1}^{jk} ,U} \right)$$
(34)

where \(\Delta_{t - 1}^{jk} { } = \hat{J}_{t - 1}^{j} - \hat{J}_{t - 1}^{k}\) is the state vector distance between the jth joint and the kth joint at the previous time t−1. Assume that the distance of two neighboring joints would not change dramatically in two successive time moments, and we treated \(\hat{J}_{t}^{k} + \Delta_{t - 1}^{jk}\) as an estimation of our target joint \(J_{t}^{j}\) from its neighboring joint \(\hat{J}_{t}^{k}\), and \(U\) is the spatial covariance that encodes uncertainty of spatial estimation. Besides, \(g^{j}\) denotes the number of neighbors of the jth joint. Then, combined the spatial prior in (34) with the temporal prior in (33), the whole prior term in (29) can be reformulated as

$$\begin{gathered} P\left( {J_{t}^{j} {|}E_{*,1:t - 1}^{j} } \right) \cdot \mathop \prod \limits_{k \in Nb\left( j \right)}^{{}} P\left( {J_{t}^{j} {|}\hat{J}_{t}^{k} } \right) \hfill \\ \;\;\; = {\mathcal{N}}\left( {J_{t}^{j} {|}A.\hat{J}_{t - 1}^{j} ,AV_{t - 1}^{j} A^{T} + Q} \right) \cdot \mathop \prod \limits_{k \in Nb\left( j \right)}^{{}} {\mathcal{N}}\left( {J_{t}^{j} {|}J_{t}^{k} + \Delta_{t - 1}^{jk} ,U} \right) \hfill \\ \;\;\; = {\mathcal{N}}\left( {J_{t}^{j} |A.\hat{J}_{t - 1}^{j} ,AV_{t - 1}^{j} A^{T} + Q} \right) \cdot {\mathcal{N}}\left( {J_{t}^{j} |\frac{{\mathop \sum \nolimits_{k \in Nb\left( j \right)}^{{}} (J_{t}^{k} + \Delta_{t - 1}^{jk} )}}{{g^{j} }},U} \right). \hfill \\ \end{gathered}$$
(35)

We can further combine these two terms in (35) into a single Gaussian prior term \({\mathcal{N}}\left( {J_{t}^{j} {|}\mu_{t}^{j} ,{\Psi }_{t}^{j} } \right)\) where \(\mu_{t}^{j}\,\, {\text and} \,\,{\Psi }_{t}^{j}\) are formulated as (12)–(14).

Next, by referring to Eqs. (29), (30), and (35), the MAP estimation in Eq. (11) becomes

$$\begin{gathered} \hat{J}_{t}^{j} = {\text{arg}}\mathop {\max }\limits_{{J_{t}^{j} }} P\left( {J_{t}^{j} {|}E_{{{*},{\text{t}}}}^{j} ,\hat{J}_{t - 1,}^{j} { }\hat{J}_{t}^{k,k \in Nb\left( j \right)} } \right) \hfill \\ \;\;\;\;\;\;\;\mathop {\max }\limits_{{J_{t}^{j} }} {\mathcal{N}}\left( {E_{{{*},{\text{t}}}}^{j} {|}H \cdot J_{t}^{j} ,\sigma_{R} } \right) \cdot {\mathcal{N}}\left( {J_{t}^{j} {|}\mu_{t}^{j} ,{\Psi }_{t}^{j} } \right) \hfill \\ \end{gathered}$$
(36)

To solve the MAP problem, we follow the theory of Kalman filtering. As mentioned, the prediction phase is shown as Eqs. (12)–(14), whereas the update phase is derived as Eqs. (15)–(17).

Appendix D: Time delay analysis for synchronization

Synchronization is a critical challenge in every multi-camera system. Several methods [26, 32, 44, 45] have been proposed to ensure temporal synchronization among cameras. In this work, we applied the method in [32] that does not require any extra hardware for synchronization. In the method, the server first sends a synchronization signal to all client computers through a self-constructed intranet simultaneously. After receiving the signal, the client computers will start to capture one frame and process it. When finishing processing, every client will send back an acknowledgment signal and the processed data to the server. Next, the server will start to do the fusion and tracking after receiving the acknowledgment signal. Finally, once the server accomplishes its processing, the next synchronization signal will be issued. The overall processing sequence is as shown in Fig. 

Fig. 19
figure 19

The processing sequence in our system (take one client for example)

19. Since the way we synchronize multiple sensors is software-based, there exists an inevitable signal transmission delay, which is shown as the “Internet delay” in Fig. 19. Indeed, the software-based method cannot guarantee perfect synchronization due to the “Internet delay.” On average, our system has 0.01827 s offsets among 3 Kinects. Though the delay does not degrade the system performance much as reported in our experimental section, our localization accuracy can be further improved by reducing the delay. For improvement, a hardware-based method like [26] can be applied if the device can support it.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, MH., Hsiao, CC., Cheng, WH. et al. Practical 3D human skeleton tracking based on multi-view and multi-Kinect fusion. Multimedia Systems 28, 529–552 (2022). https://doi.org/10.1007/s00530-021-00846-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-021-00846-x

Keywords

Navigation