Skip to main content

D &D: Learning Human Dynamics from Dynamic Camera

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13665))

Included in the following conference series:

Abstract

3D human pose estimation from a monocular video has recently seen significant improvements. However, most state-of-the-art methods are kinematics-based, which are prone to physically implausible motions with pronounced artifacts. Current dynamics-based methods can predict physically plausible motion but are restricted to simple scenarios with static camera view. In this work, we present D &D (Learning Human Dynamics from Dynamic Camera), which leverages the laws of physics to reconstruct 3D human motion from the in-the-wild videos with a moving camera. D &D introduces inertial force control (IFC) to explain the 3D human motion in the non-inertial local frame by considering the inertial forces of the dynamic camera. To learn the ground contact with limited annotations, we develop probabilistic contact torque (PCT), which is computed by differentiable sampling from contact probabilities and used to generate motions. The contact state can be weakly supervised by encouraging the model to generate correct motions. Furthermore, we propose an attentive PD controller that adjusts target pose states using temporal information to obtain smooth and accurate pose control. Our approach is entirely neural-based and runs without offline optimization or simulation in physics engines. Experiments on large-scale 3D human motion benchmarks demonstrate the effectiveness of D &D, where we exhibit superior performance against both state-of-the-art kinematics-based and dynamics-based methods. Code is available at https://github.com/Jeff-sjtu/DnD.

Cewu Lu—The member of Qing Yuan Research Institute and MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China and Shanghai Qi Zhi institute.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., Kolter, J.Z.: Differentiable convex optimization layers. In: NeurIPS (2019)

    Google Scholar 

  2. Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3D human pose reconstruction. In: CVPR (2015)

    Google Scholar 

  3. Andrews, S., Huerta, I., Komura, T., Sigal, L., Mitchell, K.: Real-time physics-based motion capture with sparse sensors. In: CVMP (2016)

    Google Scholar 

  4. Arnab, A., Doersch, C., Zisserman, A.: Exploiting temporal context for 3D human pose estimation in the wild. In: CVPR (2019)

    Google Scholar 

  5. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep It SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34

    Chapter  Google Scholar 

  6. Choi, H., Moon, G., Chang, J.Y., Lee, K.M.: Beyond static features for temporally consistent 3D human pose and shape from a video. In: CVPR (2021)

    Google Scholar 

  7. Dabral, R., Shimada, S., Jain, A., Theobalt, C., Golyanik, V.: Gravity-aware monocular 3D human-object reconstruction. In: ICCV (2021)

    Google Scholar 

  8. Fang, H., Xu, Y., Wang, W., Liu, X., Zhu, S.C.: Learning pose grammar to encode human body configuration for 3D pose estimation. In: AAAI (2017)

    Google Scholar 

  9. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments. In: TPAMI (2013)

    Google Scholar 

  10. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax. In: ICLR (2017)

    Google Scholar 

  11. Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation. In: 3DV (2021)

    Google Scholar 

  12. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)

    Google Scholar 

  13. Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: CVPR (2019)

    Google Scholar 

  14. Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: video inference for human body pose and shape estimation. In: CVPR (2020)

    Google Scholar 

  15. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)

    Google Scholar 

  16. Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: CVPR (2017)

    Google Scholar 

  17. Levine, S., Popović, J.: Physically plausible simulation for character animation. In: SIGGRAPH (2012)

    Google Scholar 

  18. Li, J., et al.: Localization with sampling-argmax. Adv. Neural. Inf. Process. Syst. 34, 27236–27248 (2021)

    Google Scholar 

  19. Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: Hybrik: a hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In: CVPR (2021)

    Google Scholar 

  20. Li, Y.L., et al.: Detailed 2D–3D joint representation for human-object interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10166–10175 (2020)

    Google Scholar 

  21. Li, Y.L., et al.: Pastanet: toward human activity knowledge engine. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 382–391 (2020)

    Google Scholar 

  22. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. In: TOG (2015)

    Google Scholar 

  23. Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3D human motion estimation via motion compression and refinement. In: ACCV (2020)

    Google Scholar 

  24. Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic policy for egocentric pose estimation. In: NeurIPS (2021)

    Google Scholar 

  25. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: archive of motion capture as surface shapes. In: ICCV (2019)

    Google Scholar 

  26. von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUS and a moving camera. In: ECCV (2018)

    Google Scholar 

  27. Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: ICCV (2017)

    Google Scholar 

  28. Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 3DV (2017)

    Google Scholar 

  29. Mehta, D., et al.: XNect: real-time multi-person 3D motion capture with a single RGB camera. TOG (2020)

    Google Scholar 

  30. Mehta, D., et al.: Single-shot multi-person 3D pose estimation from monocular RGB. In: 3DV (2018)

    Google Scholar 

  31. Mehta, D., et al.: Vnect: real-time 3D human pose estimation with a single RGB camera. TOG 36(4), 1–14 (2017)

    Article  Google Scholar 

  32. Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV (2019)

    Google Scholar 

  33. Moon, G., Lee, K.M.: I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 752–768. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_44

    Chapter  Google Scholar 

  34. Moreno-Noguer, F.: 3D human pose estimation from a single image via distance matrix regression. In: CVPR (2017)

    Google Scholar 

  35. Park, S., Hwang, J., Kwak, N.: 3D human pose estimation using convolutional neural networks with 2D pose information. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 156–169. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_15

    Chapter  Google Scholar 

  36. Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)

    Google Scholar 

  37. Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR (2017)

    Google Scholar 

  38. Peng, X.B., Chang, M., Zhang, G., Abbeel, P., Levine, S.: MCP: learning composable hierarchical control with multiplicative compositional policies. arXiv preprint arXiv:1905.09808 (2019)

  39. Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3D human motion model for robust pose estimation. In: ICCV (2021)

    Google Scholar 

  40. Rempe, D., Guibas, L.J., Hertzmann, A., Russell, B., Villegas, R., Yang, J.: Contact and human dynamics from monocular video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 71–87. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_5

    Chapter  Google Scholar 

  41. Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-net: localization-classification-regression for human pose. In: CVPR (2017)

    Google Scholar 

  42. Shahabpoor, E., Pavic, A.: Measurement of walking ground reactions in real-life environments: a systematic review of techniques and technologies. Sensors (2017)

    Google Scholar 

  43. Shimada, S., Golyanik, V., Xu, W., Pérez, P., Theobalt, C.: Neural monocular 3D human motion capture with physical awareness. TOG (2021)

    Google Scholar 

  44. Shimada, S., Golyanik, V., Xu, W., Theobalt, C.: Physcap: physically plausible monocular 3D motion capture in real time. TOG (2020)

    Google Scholar 

  45. Song, J., Chen, X., Hilliges, O.: Human body model fitting by learned gradient descent. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 744–760. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_44

    Chapter  Google Scholar 

  46. Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: ICCV (2017)

    Google Scholar 

  47. Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_33

    Chapter  Google Scholar 

  48. Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: ICCV (2019)

    Google Scholar 

  49. Varol, G., Ceylan, D., Russell, B., Yang, J., Yumer, E., Laptev, I., Schmid, C.: BodyNet: volumetric inference of 3D human body shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 20–38. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_2

    Chapter  Google Scholar 

  50. Wan, Z., Li, Z., Tian, M., Liu, J., Yi, S., Li, H.: Encoder-decoder with multi-level attention for 3D human shape and pose estimation. In: ICCV (2021)

    Google Scholar 

  51. Wang, C., Li, J., Liu, W., Qian, C., Lu, C.: HMOR: hierarchical multi-person ordinal relations for monocular multi-person 3D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 242–259. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_15

    Chapter  Google Scholar 

  52. Xu, H., Bazavan, E.G., Zanfir, A., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: Ghum & ghuml: generative 3D human shape and articulated pose models. In: CVPR (2020)

    Google Scholar 

  53. Yang, C., Huang, Q., Jiang, H., Peter, O.O., Han, J.: PD control with gravity compensation for hydraulic 6-DOF parallel manipulator. Mech. Mach. Theory 45(4), 666–677 (2010)

    Article  MATH  Google Scholar 

  54. Yasin, H., Iqbal, U., Kruger, B., Weber, A., Gall, J.: A dual-source approach for 3D pose estimation from a single image. In: CVPR (2016)

    Google Scholar 

  55. Yu, R., Park, H., Lee, J.: Human dynamics from monocular video with dynamic camera movements. TOG (2021)

    Google Scholar 

  56. Yuan, Y., Iqbal, U., Molchanov, P., Kitani, K., Kautz, J.: Glamr: global occlusion-aware human mesh recovery with dynamic cameras. In: CVPR (2022)

    Google Scholar 

  57. Yuan, Y., Kitani, K.: Ego-pose estimation and forecasting as real-time PD control. In: ICCV (2019)

    Google Scholar 

  58. Yuan, Y., Kitani, K.: Residual force control for agile human behavior imitation and extended motion synthesis. In: NeurIPS (2020)

    Google Scholar 

  59. Yuan, Y., Wei, S.E., Simon, T., Kitani, K., Saragih, J.: Simpoe: simulated character control for 3D human pose estimation. In: CVPR (2021)

    Google Scholar 

  60. Zell, P., Rosenhahn, B., Wandt, B.: Weakly-supervised learning of human dynamics. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 68–84. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_5

    Chapter  Google Scholar 

  61. Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., Lin, S.: SRNet: improving generalization in 3D human pose estimation with a split-and-recombine approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 507–523. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_30

    Chapter  Google Scholar 

  62. Zeng, A., Yang, L., Ju, X., Li, J., Wang, J., Xu, Q.: Smoothnet: a plug-and-play network for refining human poses in videos. arXiv preprint arXiv:2112.13715 (2021)

  63. Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: ICCV (2017)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the National Key R &D Program of China (No. 2021ZD0110700), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), Shanghai Qi Zhi Institute, SHEITC (2018-RGZN-02046) and Tencent GY-Lab.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cewu Lu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2984 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, J., Bian, S., Xu, C., Liu, G., Yu, G., Lu, C. (2022). D &D: Learning Human Dynamics from Dynamic Camera. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13665. Springer, Cham. https://doi.org/10.1007/978-3-031-20065-6_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20065-6_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20064-9

  • Online ISBN: 978-3-031-20065-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics