Abstract
LiDAR-based human motion capture has garnered significant interest in recent years for its practicability in large-scale and unconstrained environments. However, most methods rely on cleanly segmented human point clouds as input, the accuracy and smoothness of their motion results are compromised when faced with noisy data, rendering them unsuitable for practical applications. To address these limitations and enhance the robustness and precision of motion capture with noise interference, we introduce LiveHPS++, an innovative and effective solution based on a single LiDAR system. Benefiting from three meticulously designed modules, our method can learn dynamic and kinematic features from human movements, and further enable the precise capture of coherent human motions in open settings, making it highly applicable to real-world scenarios. Through extensive experiments, LiveHPS++ has proven to significantly surpass existing state-of-the-art methods across various datasets, establishing a new benchmark in the field. https://4dvlab.github.io/project_page/LiveHPS2.html
This work was supported by NSFC (No. 62206173), MoE Key Laboratory of Intelligent Perception and Human-Machine Collaboration (ShanghaiTech University), Shanghai Frontiers Science Center of Human-centered Artificial Intelligence (ShangHAI).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amin, S., Andriluka, M., Rohrbach, M., Schiele, B.: Multi-view pictorial structures for 3D human pose estimation. In: BMVC (2009). https://doi.org/10.5244/C.27.45
Baak, A., Müller, M., Bharaj, G., Seidel, H.P., Theobalt, C.: A data-driven approach for real-time full body pose reconstruction from a depth camera. In: ICCV (2011). https://doi.org/10.1109/ICCV.2011.6126356
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. pp. 561–578. Springer (2016)
Bregler, C., Malik, J.: Tracking people with twists and exponential maps. In: CVPR (1998). https://doi.org/10.1109/CVPR.1998.698581
Burenius, M., Sullivan, J., Carlsson, S.: 3D pictorial structures for multiple view articulated pose estimation. In: CVPR (2013). https://doi.org/10.1109/CVPR.2013.464
Cai, Z., et al.: Pointhps: cascaded 3d human pose and shape estimation from point clouds. arXiv preprint arXiv:2308.14492 (2023)
Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
Cong, P., et al.: Stcrowd: a multimodal dataset for pedestrian perception in crowded scenes. In: CVPR, pp. 19608–19617, June 2022. arXiv preprint arXiv:2204.01026 (2022)
Dai, Y., et al.: Sloper4d: a scene-aware dataset for global 4d human pose estimation in urban environments. arXiv preprint arXiv:2303.09095 (2023)
De Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. In: ACM SIGGRAPH 2008 Papers, pp. 1–10 (2008)
Elhayek, A., et al.: Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras. In: CVPR (2015). http://gvv.mpi-inf.mpg.de/projects/convNet_moCap/
Guo, K., et al.: Twinfusion: high framerate non-rigid fusion through fast correspondence tracking. In: 3DV, pp. 596–605 (2018)
Habermann, M., Xu, W., Zollhöfer, M., Pons-Moll, G., Theobalt, C.: Livecap: real-time human performance capture from monocular video. ACM Trans. Graph. (TOG) 38(2), 14:1–14:17 (2019)
Habermann, M., Xu, W., Zollhofer, M., Pons-Moll, G., Theobalt, C.: Deepcap: monocular human performance capture using weak supervision. In: CVPR, June 2020
He, Y., Pang, A., Chen, X., Liang, H., Wu, M., Ma, Y., Xu, L.: Challencap: monocular 3d capture of challenging human performances using multi-modal references. In: CVPR, pp. 11400–11411 (2021)
Holte, M.B., Tran, C., Trivedi, M.M., Moeslund, T.B.: Human pose estimation and activity recognition from multi-view videos: Comparative explorations of recent developments. JSTSP 6(5), 538–552 (2012). https://doi.org/10.1109/JSTSP.2012.2196975
Huang, Y., et al.: Towards accurate marker-less human shape and pose estimation over time. In: 3DV, pp. 421–430 (2017). https://doi.org/10.1109/3DV.2017.00055
Huang, Y., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. (TOG) 37(6), 1–15 (2018)
Jang, D.K., Yang, D., Jang, D.Y., Choi, B., Jin, T., Lee, S.H.: Movin: real-time motion capture using a single lidar. arXiv preprint arXiv:2309.09314 (2023)
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015). https://doi.org/10.1109/ICCV.2015.381
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: CVPR, June 2019
Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: video inference for human body pose and shape estimation. In: CVPR, June 2020
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: CVPR (2019)
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: Closing the loop between 3d and 2d human representations. In: CVPR, pp. 6050–6059 (2017)
Li, J., et al.: Lidarcap: long-range marker-less 3d human motion capture with lidar point clouds. arXiv preprint arXiv:2203.14698 (2022)
Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic policy for egocentric pose estimation. Advances in Neural Information Processing Systems 34 (2021)
Noitom Motion Capture Systems (2015). https://www.noitom.com/
OptiTrack Motion Capture Systems (2009). https://www.optitrack.com/
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Harvesting multiple views for marker-less 3d human pose annotations. In: CVPR (2017)
Peng, X., Zhu, X., Ma, Y.: Cl3d: Unsupervised domain adaptation for cross-lidar 3d detection. AAAI (2023)
Ren, Y., et al.: Livehps: lidar-based scene-level human pose and shape estimation in free environment. arXiv preprint arXiv:2402.17171 (2024)
Ren, Y., et al.: Lidar-aid inertial poser: large-scale human motion capture by sparse inertial and lidar sensors. TVCG (2023)
Rhodin, H., Robertini, N., Richardt, C., Seidel, H.P., Theobalt, C.: A versatile scene model with differentiable visibility applied to generative pose estimation. In: ICCV (2015). https://doi.org/10.1109/ICCV.2015.94
Robertini, N., Casas, D., Rhodin, H., Seidel, H.P., Theobalt, C.: Model-based outdoor performance capture. In: 3DV (2016). http://gvv.mpi-inf.mpg.de/projects/OutdoorPerfcap/
Shotton, J., et al.: Real-time human pose recognition in parts from single depth images. In: CVPR (2011)
Sigal, L., Bălan, A.O., Black, M.J.: HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV (2010). https://doi.org/10.1007/s11263-009-0273-6
Sigal, L., Isard, M., Haussecker, H., Black, M.J.: Loose-limbed people: estimating 3D human pose and motion using non-parametric belief propagation. IJCV 98(1), 15–48 (2012). https://doi.org/10.1007/s11263-011-0493-4
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: CVPR (2017)
Stoll, C., Hasler, N., Gall, J., Seidel, H.P., Theobalt, C.: Fast articulated motion tracking using a sums of Gaussians body model. In: ICCV (2011)
Theobalt, C., de Aguiar, E., Stoll, C., Seidel, H.P., Thrun, S.: Performance capture from multi-view video. In: Image and Geometry Processing for 3-D Cinematography, pp. 127–149. Springer (2010)
Varol, G., et al.: Learning from synthetic humans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 109–117 (2017)
Vicon Motion Capture Systems (2010). https://www.vicon.com/
Vlasic, D., et al.: Practical motion capture in everyday surroundings. TOG 26(3), 35–es (2007)
Von Marcard, T., Rosenhahn, B., Black, M.J., Pons-Moll, G.: Sparse inertial poser: automatic 3d human pose estimation from sparse imus. In: Computer Graphics Forum, vol. 36, pp. 349–360. Wiley Online Library (2017)
Wei, X., Zhang, P., Chai, J.: Accurate realtime full-body motion capture using a single depth camera. SIGGRAPH Asia 31(6), 188:1–12 (2012)
Xsens Technologies B.V. (2011). https://www.xsens.com/
Xu, L., Su, Z., Han, L., Yu, T., Liu, Y., Fang, L.: Unstructuredfusion: realtime 4d geometry and texture reconstruction using commercialrgbd cameras. TPAMI, p. 1 (2019)
Xu, L.: Flycap: markerless motion capture using multiple autonomous flying cameras. TVCG 24(8), 2284–2297 (2018)
Xu, L., Xu, W., Golyanik, V., Habermann, M., Fang, L., Theobalt, C.: Eventcap: monocular 3d capture of high-speed human motions using an event camera. In: CVPR, June 2020
Xu, W., et al.: Monoperfcap: human performance capture from monocular video. ACM Trans. Graph. (TOG) 37(2), 27:1–27:15 (2018)
Xu, Y., et al.: Human-centric scene understanding for 3d large-scale scenarios. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20349–20359 (2023)
Yi, X., Zhou, Y., Habermann, M., Shimada, S., Golyanik, V., Theobalt, C., Xu, F.: Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. In: CVPR, June 2022
Yi, X., Zhou, Y., Xu, F.: Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)
Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3d object detection and tracking. CVPR (2021)
Yu, T., et al.: Doublefusion: real-time capture of human performances with inner body shapes from a single depth sensor. TPAMI (2019)
Yuan, Y., Kitani, K.: Residual force control for agile human behavior imitation and extended motion synthesis. Adv. Neural. Inf. Process. Syst. 33, 21763–21774 (2020)
Zanfir, A., Bazavan, E.G., Zanfir, M., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: Neural descent for visual 3d human pose and shape. arXiv preprint arXiv:2008.06910 (2020)
Zhu, X., Ma, Y., Wang, T., Xu, Y., Shi, J., Lin, D.: SSN: shape signature networks for multi-class object detection from point clouds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 581–597. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_35
Zhu, X., et al.: Cylindrical and asymmetrical 3d convolution networks for lidar-based perception. TPAMI (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ren, Y., Han, X., Yao, Y., Long, X., Sun, Y., Ma, Y. (2025). LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15087. Springer, Cham. https://doi.org/10.1007/978-3-031-73397-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-73397-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73396-3
Online ISBN: 978-3-031-73397-0
eBook Packages: Computer ScienceComputer Science (R0)