Skip to main content
Log in

DeMoCap: Low-Cost Marker-Based Motion Capture

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Optical marker-based motion capture (MoCap) remains the predominant way to acquire high-fidelity articulated body motions. We introduce DeMoCap, the first data-driven approach for end-to-end marker-based MoCap, using only a sparse setup of spatio-temporally aligned, consumer-grade infrared-depth cameras. Trading off some of their typical features, our approach is the sole robust option for far lower-cost marker-based MoCap than high-end solutions. We introduce an end-to-end differentiable markers-to-pose model to solve a set of challenges such as under-constrained position estimates, noisy input data and spatial configuration invariance. We simultaneously handle depth and marker detection noise, label and localize the markers, and estimate the 3D pose by introducing a novel spatial 3D coordinate regression technique under a multi-view rendering and supervision concept. DeMoCap is driven by a special dataset captured with 4 spatio-temporally aligned low-cost Intel RealSense D415 sensors and a 24 MXT40S camera professional MoCap system, used as input and ground truth, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. For the sake of clarity, we mention that T-Pose or any other pose is not required for the initialization of our method.

  2. For the dataset recordings, we used the publicly available volumetric capturing tool (https://github.com/VCL3D/VolumetricCapture) proposed by Sterzentsenko et al. (2018).

  3. The models that were publicly available in the official repository of the authors at the time of this work.

  4. https://github.com/karfly/learnable-triangulation-pytorch/tree/9d1a26ea893a513bdff55f30ecbfd2ca8217bf5d

  5. https://github.com/CMU-Perceptual-Computing-Lab/openpose/tree/1f1aa9c59fe59c90cca685b724f4f97f76137224

  6. https://github.com/tofis/democap

  7. http://mocap.cs.sfu.ca/index154af.html?id=0018_DanceTurns002.bvh

  8. http://mocap.cs.sfu.ca/index1fe61.html?id=0015_HopOverObstacle001.bvh.

References

  • Alexanderson, S., O’Sullivan, C., & Beskow, J. (2017). Real-time labeling of non-rigid motion capture marker sets. Computers & Graphics, 69, 59–67.

  • Bascones, J. L. J. (2019). Cloud point labelling in optical motion capture systems. Ph.D. thesis, Universidad del País Vasco-Euskal Herriko Unibertsitatea.

  • Bekhtaoui, W., Sa, R., Teixeira, B., Singh, V., Kirchberg, K., Yj, Chang, & Kapoor, A. (2020). View invariant human body detection and pose estimation from multiple depth sensors. arXiv preprint arXiv:2005.04258.

  • Buhrmester, V., Münch, D., Bulatov, D., & Arens, M. (2019). Evaluating the impact of color information in deep neural networks. In Iberian conference on pattern recognition and image analysis (pp. 302–316). Springer.

  • Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3D pictorial structures for multiple view articulated pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3618–3625).

  • Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7291–7299).

  • Chatzitofis, A., Zarpalas, D., Kollias, S., & Daras, P. (2019). Deepmocap: Deep optical motion capture using multiple depth sensors and retro-reflectors. Sensors, 19(2), 282.

    Article  Google Scholar 

  • Chatzitofis, A., Saroglou, L., Boutis, P., Drakoulis, P., Zioulis, N., Subramanyam, S., Kevelham, B., Charbonnier, C., Cesar, P., Zarpalas, D., et al. (2020). Human4d: A human-centric multimodal dataset for motions and immersive media. IEEE Access, 8, 176241–176262.

    Article  Google Scholar 

  • Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L. (2019). Bottom-up higher-resolution networks for multi-person pose estimation. arXiv preprint arXiv:1908.10357.

  • Doosti, B., Naha, S., Mirbagheri, M., & Crandall, D. J. (2020). Hope-net: A graph-based model for hand-object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6608–6617).

  • Elhayek, A., de Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., & Theobalt, C. (2015). Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3810–3818).

  • Feng, Z. H., Kittler, J., Awais, M., Huber, P., & Wu, X. J. (2018). Wing loss for robust facial landmark localisation with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2235–2245).

  • Fuglede, B., Topsoe, F. (2004). Jensen-shannon divergence and hilbert space embedding. In International symposium on information theory, 2004. ISIT 2004. Proceedings (p. 31). IEEE.

  • Gao, H., & Ji, S. (2019). Graph u-nets. In International conference on machine learning, PMLR (pp. 2083–2092).

  • Gaschler, A. (2011). Real-time marker-based motion tracking: Application to kinematic model estimation of a humanoid robot. Thesis

  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th international conference on artificial intelligence and statistics (pp. 249–256).

  • Guler, R. A., & Kokkinos, I. (2019). Holopose: Holistic 3D human reconstruction in-the-wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 10884–10894).

  • Han, S., Liu, B., Wang, R., Ye, Y., Twigg, C. D., & Kin, K. (2018). Online optical marker-based hand tracking with deep labels. ACM Transactions on Graphics (TOG), 37(4), 166.

    Article  Google Scholar 

  • Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., & Fei-Fei, L. (2016). Towards viewpoint invariant 3D human pose estimation. In European conference on computer vision (pp. 160–177). Springer

  • Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge University Press.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

  • Holden, D. (2018). Robust solving of optical motion capture data by denoising. ACM Transactions on Graphics (TOG), 37(4), 1–12.

    Article  Google Scholar 

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2013). Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.

  • Iskakov, K., Burkov, E., Lempitsky, V., & Malkov, Y. (2019). Learnable triangulation of human pose. In Proceedings of the IEEE international conference on computer vision (pp. 7718–7727).

  • Joo, H., Simon, T., & Sheikh, Y. (2018). Total capture: A 3D deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8320–8329).

  • Keselman, L., Iselin Woodfill, J., Grunnet-Jepsen, A., & Bhowmik, A. (2017). Intel realsense stereoscopic depth cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 1–10).

  • Kingma, D. P., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3D human pose estimation. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer.

  • Loper, M., Mahmood, N., & Black, M. J. (2014). Mosh: Motion and shape capture from sparse markers. ACM Transactions on Graphics (TOG), 33(6), 220.

    Article  Google Scholar 

  • Luvizon, D. C., Tabia, H., & Picard, D. (2019). Human pose regression by combining indirect part detection and contextual information. Computers & Graphics, 85, 15–22.

    Article  Google Scholar 

  • Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G., Black, M. J. (2019). Amass: Archive of motion capture as surface shapes. arXiv preprint arXiv:1904.03278.

  • Martínez-González, A., Villamizar, M., Canévet, O., Odobez, J. M. (2018a). Investigating depth domain adaptation for efficient human pose estimation. In 2018 European conference on computer vision—workshops, ECCV 2018.

  • Martínez-González, A., Villamizar, M., Canévet, O., & Odobez, J. M. (2018b). Real-time convolutional networks for depth-based human pose estimation. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 41–47). https://doi.org/10.1109/IROS.2018.8593383.

  • Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H. P., Xu, W., Casas, D., & Theobalt, C. (2017). Vnect: Real-time 3D human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG), 36(4), 1–14.

    Article  Google Scholar 

  • Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, M., Fua, P., Seidel, H. P., Rhodin, H., Pons-Moll, G., Theobalt, C. (2019). Xnect: Real-time multi-person 3D human pose estimation with a single RGB camera. arXiv preprint arXiv:1907.00837.

  • moai, . (2021). moai: Accelerating modern data-driven workflows. https://github.com/ai-in-motion/moai.

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In European conference on computer vision (pp. 483–499). Springer.

  • Nibali, A., He, Z., Morgan, S., Prendergast, L. (2018). Numerical coordinate regression with convolutional neural networks. arXiv preprint arXiv:1801.07372.

  • Park, S., Yong Chang, J., Jeong, H., Lee, J. H., & Park, J. Y. (2017). Accurate and efficient 3D human pose estimation algorithm using single depth images for pose analysis in golf. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 49–57).

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. de-Buc, E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems 32 (pp. 8024–8035). Curran Associates Inc.

  • Pavllo, D., Porssut, T., Herbelin, B., & Boulic, R. (2018). Real-time finger tracking using active motion capture: A neural network approach robust to occlusions. In Proceedings of the 11th annual international conference on motion, interaction, and games (pp. 1–10).

  • Perepichka, M., Holden, D., Mudur, S. P., & Popa, T. (2019). Robust marker trajectory repair for mocap using kinematic reference. In Motion, interaction and games (pp. 1–10). Ernst & Sohn.

  • Qi, C. R., Yi, L., Su, H., Guibas, L. J. (2017). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems (pp. 5099–5108).

  • Qiu, H., Wang, C., Wang, J., Wang, N., & Zeng, W. (2019). Cross view fusion for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 4342–4351).

  • Rhodin, H., Salzmann, M., & Fua, P. (2018). Unsupervised geometry-aware representation for 3D human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 750–767).

  • Riegler, G., Osman Ulusoy, A., & Geiger, A. (2017). Octnet: Learning deep 3D representations at high resolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3577–3586).

  • Rüegg, N., Lassner, C., Black, M. J., Schindler, K. (2020). Chained representation cycling: Learning to estimate 3D human pose and shape by cycling between representations. arXiv preprint arXiv:2001.01613.

  • Sigal, L., Isard, M., Haussecker, H., & Black, M. J. (2012). Loose-limbed people: Estimating 3D human pose and motion using non-parametric belief propagation. International Journal of Computer Vision, 98(1), 15–48.

    Article  MathSciNet  Google Scholar 

  • Sterzentsenko, V., Karakottas, A., Papachristou, A., Zioulis, N., Doumanoglou, A., Zarpalas, D., & Daras, P. (2018). A low-cost, flexible and portable volumetric capturing system. In 2018 14th international conference on signal-image technology & internet-based systems (SITIS) (pp. 200–207). IEEE.

  • Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In Proceedings of the European conference on computer vision (ECCV) (pp. 529–545).

  • Tensmeyer, C., Martinez, T. (2019). Robust keypoint detection. In 2019 international conference on document analysis and recognition workshops (ICDARW) (Vol. 5, pp. 1–7). IEEE.

  • Tompson, J.J., Jain, A., LeCun, Y., Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems (pp. 1799–1807).

  • Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1653–1660).

  • Tu, H., Wang, C., Zeng, W. (2020). Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 (pp. 197–212). Springer.

  • VICON L. (1984). Vicon systems ltd. https://www.vicon.com/

  • Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4724–4732).

  • Yang, Y., Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR 2011 (pp. 1385–1392). IEEE.

  • Ying, K. Y. G. J. (2011). Sfu motion capture database. http://mocap.cs.sfu.ca/

  • Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A. I., & Sminchisescu, C. (2018). Deep network for the integrated 3D sensing of multiple people in natural images. Advances in Neural Information Processing Systems, 31, 8410–8419.

    Google Scholar 

  • Zhang, F., Zhu, X., Dai, H., Ye, M., & Zhu, C. (2020a). Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7093–7102).

  • Zhang, Y., An, L., Yu, T., Li, X., Li, K., & Liu, Y. (2020b). 4D association graph for realtime multi-person motion capture using multiple video cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1324–1333).

  • Zhang, Z. (2012). Microsoft kinect sensor and its effect. IEEE Multimedia, 19(2), 4–10.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anargyros Chatzitofis.

Additional information

Communicated by Gregory Rogez.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Network Architecture Details

Appendix A Network Architecture Details

For all architectures, we follow the same schema. All our networks consist of 2 super-stages, meaning groups of stages as defined in multi-stage and multi-branch FCN architectures. The first super-stage predicts \(\mathbf{H }_{M,1,\ldots , K}\) heatmaps which are aggregated to the final marker heatmaps \(\bar{\mathbf{H }}_M\), while the second super-stage predicts heatmaps \(\mathbf{H }_{J,K+1,\ldots ,2K}\) which are aggregated to the final joint heatmaps \(\bar{\mathbf{H }}_J\). The predictions of each stage and their corresponding image features are concatenated for each subsequent stage. High level designs of the various architectures are illustrated in Fig. 12. We discuss each network details below.

1.1 A.1 Convolutional Pose Machines (CPM)

Following the original work by Wei et al. (2016), we stack 6 stages in total, separated in the two super-stages of 3 stages each. However, we reduce the number of MaxPooling layers to 2 instead of 3 by removing the third one. This results into a higher resolution feature map \(\mathbf{F }\), (i.e. of 2D spatial size \(40 \times 40\)) leading to an increased heatmap resolution, similar to the rest of the networks.

Subsequently, we follow the stage architecture as originally proposed in the CPM network, i.e. the first stage consists of one \(9 \times 9\) followed by two \(1 \times 1\) convolutional layers, whilst every next stage is composed of 5 convolutional layers (\(3 \cdot 11 \times 11 - 2 \cdot 1 \times 1\)). All these stages are fed with the concatenation of \(\mathbf{F }\) and the output of the previous stage, except for Stage1 which is only fed with \({\mathbf {F}}\) alone.

1.2 A.2 Stacked Hourglass (SH)

We build a 8-stage Stacked Hourglass (Newell et al. (2016)) based on the configuration of the original work, selecting 4 stages per super-stage. Starting from a pre-processing module, a feature map \(\mathbf{F }\) is extracted which, similarly for all networks, we concatenate with the intermediate feature map outputs of each stage. We use hourglass modules with depth equal to 2, reaching to heatmaps of 2D spatial size equal to \(40 \times 40\).

1.3 A.3 High Resolution Network (HRNET)

Following the configuration of the original work (Wang et al. (2020)), we build a HRNET-based network by staging two 4-stage HRNET architectures, one per super-stage. Due to the multi-stage and multi-branch design of HRNET, we select to build the second super-stage as a 4-stage HRNET instead of a 8-stage and 8-branch model. Similarly, we feed the second super-stage with the feature maps \(\mathbf{F }\) concatenated with the heatmap outputs. The configuration of each super-stage module is similar to the one proposed in the original work. The initial stage contains 4 residual units formed by a bottleneck with width equal to 64, followed by one \(3 \times 3\) convolution reducing the width of feature maps to \(40 \times 40\). We follow the same schema with the original work where the second, the third and four stages of each super-stage consist of 1, 4, 3 exchange blocks, correspondingly.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chatzitofis, A., Zarpalas, D., Daras, P. et al. DeMoCap: Low-Cost Marker-Based Motion Capture. Int J Comput Vis 129, 3338–3366 (2021). https://doi.org/10.1007/s11263-021-01526-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01526-z

Keywords

Navigation