Skip to main content

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15140))

Included in the following conference series:

  • 402 Accesses

Abstract

We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen’s resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics.

S. Gupta and S. Wang—Equal advising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aberman, K., Weng, Y., Lischinski, D., Cohen-Or, D., Chen, B.: Unpaired motion style transfer from video to animation. ACM TOG (2020)

    Google Scholar 

  2. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. Ajay, A., et al.: Combining physical simulators and object-based networks for control. In: ICRA (2019)

    Google Scholar 

  4. Ajay, A., et al.: Augmenting physical simulators with stochastic neural networks: case study of planar pushing and bouncing. In: IROS (2018)

    Google Scholar 

  5. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. IJCV (2011)

    Google Scholar 

  6. Bar-Tal, O., et al.: Lumiere: a space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945 (2024)

  7. Battaglia, P., Pascanu, R., Lai, M., Jimenez Rezende, D., et al.: Interaction networks for learning about objects, relations and physics. In: NeurIPS (2016)

    Google Scholar 

  8. Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  9. Blattmann, A., Milbich, T., Dorkenwald, M., Ommer, B.: iPOKE: poking a still image for controlled stochastic video synthesis. In: ICCV (2021)

    Google Scholar 

  10. Blattmann, A., Milbich, T., Dorkenwald, M., Ommer, B.: Understanding object dynamics for interactive image-to-video synthesis. In: CVPR (2021)

    Google Scholar 

  11. Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR (2023)

    Google Scholar 

  12. Blomqvist, V.: Pymunk (2023). https://pymunk.org

  13. Bowen, R.S., Tucker, R., Zabih, R., Snavely, N.: Dimensions of motion: monocular prediction through flow subspaces. In: 3DV (2022)

    Google Scholar 

  14. Careaga, C., Miangoleh, S.M.H., Aksoy, Y.: Intrinsic harmonization for illumination-aware compositing. arXiv preprint arXiv:2312.03698 (2023)

  15. Chen, H., et al.: Videocrafter1: open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023)

  16. Chen, X., et al.: Livephoto: real image animation with text-guided motion control. arXiv preprint arXiv:2312.02928 (2023)

  17. Chen, X., et al.: Seine: short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700 (2023)

  18. Chuang, Y.Y., Goldman, D.B., Zheng, K.C., Curless, B., Salesin, D.H., Szeliski, R.: Animating pictures with stochastic motion textures. ACM TOG (2005)

    Google Scholar 

  19. Ciarlet, P.G., Lions, J.L.: Handbook of Numerical Analysis. Gulf Professional Publishing (1990)

    Google Scholar 

  20. Davis, A., Bouman, K.L., Chen, J.G., Rubinstein, M., Durand, F., Freeman, W.T.: Visual vibrometry: estimating material properties from small motion in video. In: CVPR (2015)

    Google Scholar 

  21. Davis, A., Chen, J.G., Durand, F.: Image-space modal bases for plausible manipulation of objects in video. ACM TOG (2015)

    Google Scholar 

  22. Eftekhar, A., Sax, A., Malik, J., Zamir, A.: Omnidata: a scalable pipeline for making multi-task mid-level vision datasets from 3D scans. In: ICCV (2021)

    Google Scholar 

  23. Endo, Y., Kanamori, Y., Kuriyama, S.: Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis. arXiv preprint arXiv:1910.07192 (2019)

  24. Friedland, B.: Control system design: an introduction to state-space methods. Courier Corporation (2012)

    Google Scholar 

  25. Fu, X., et al.: Geowizard: unleashing the diffusion priors for 3D geometry estimation from a single image. arXiv preprint arXiv:2403.12013 (2024)

  26. Ge, S., et al.: Preserve your own correlation: a noise prior for video diffusion models. In: ICCV (2023)

    Google Scholar 

  27. Geng, D., Owens, A.: Motion guidance: diffusion-based image editing with differentiable motion estimators. In: ICLR (2024)

    Google Scholar 

  28. Girdhar, R., et al.: Emu video: factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023)

  29. Guo, Y., et al.: Animatediff: animate your personalized text-to-image diffusion models without specific tuning (2023)

    Google Scholar 

  30. Gupta, A., et al.: Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662 (2023)

  31. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NeurIPS (2017)

    Google Scholar 

  32. Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  33. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  34. Holynski, A., Curless, B.L., Seitz, S.M., Szeliski, R.: Animating pictures with Eulerian motion fields. In: CVPR (2021)

    Google Scholar 

  35. Hu, Y., et al.: Difftaichi: differentiable programming for physical simulation. In: ICLR (2020)

    Google Scholar 

  36. Hu, Y., et al.: A moving least squares material point method with displacement discontinuity and two-way rigid body coupling. ACM TOG (2018)

    Google Scholar 

  37. Hu, Y., Li, T.M., Anderson, L., Ragan-Kelley, J., Durand, F.: Taichi: a language for high-performance computation on spatially sparse data structures. ACM TOG (2019)

    Google Scholar 

  38. Jhou, W.C., Cheng, W.H.: Animating still landscape photographs through cloud motion creation. IEEE Trans. Multimed. (2015)

    Google Scholar 

  39. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)

  40. Koenig, N., Howard, A.: Design and use paradigms for gazebo, an open-source multi-robot simulator. In: IROS (2004)

    Google Scholar 

  41. Kondratyuk, D., et al.: VideoPoet: a large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023)

  42. Li, X., et al.: PAC-NeRF: physics augmented continuum neural radiance fields for geometry-agnostic system identification. In: ICLR (2023)

    Google Scholar 

  43. Li, Y., et al.: Visual grounding of learned physical models. In: ICML (2020)

    Google Scholar 

  44. Li, Y., Wu, J., Tedrake, R., Tenenbaum, J.B., Torralba, A.: Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In: ICLR (2019)

    Google Scholar 

  45. Li, Z., Tucker, R., Snavely, N., Holynski, A.: Generative image dynamics. arXiv preprint arXiv:2309.07906 (2023)

  46. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  47. Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)

  48. Liu, T., Bargteil, A.W., O’Brien, J.F., Kavan, L.: Fast simulation of mass-spring systems. ACM TOG (2013)

    Google Scholar 

  49. Lv, J., et al.: GPT4motion: scripting physical motions in text-to-video generation via blender-oriented GPT planning. In: CVPRW, pp. 1430–1440 (2024)

    Google Scholar 

  50. Mahapatra, A., Kulkarni, K.: Controllable animation of fluid elements in still images. In: CVPR (2022)

    Google Scholar 

  51. Mallya, A., Wang, T.C., Liu, M.Y.: Implicit warping for animation with image sets. In: NeurIPS (2022)

    Google Scholar 

  52. Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)

    Google Scholar 

  53. Mrowca, D., et al.: Flexible neural representation for physics prediction. In: NeurIPS (2018)

    Google Scholar 

  54. NVIDIA: Nvidia Physx (2019). https://developer.nvidia.com/physx-sdk

  55. OpenAI: GPT-4v(ISION) system card (2023)

    Google Scholar 

  56. OpenAI: Creating video from text (2024). https://openai.com/sora

  57. Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)

  58. Popova, E., Popov, V.L.: The research works of coulomb and amontons and generalized laws of friction. Friction (2015)

    Google Scholar 

  59. Qiu, H., et al.: Relitalk: relightable talking portrait generation from a single video (2023)

    Google Scholar 

  60. Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks (2024)

    Google Scholar 

  61. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  62. Sanchez-Gonzalez, A., et al.: Graph networks as learnable physics engines for inference and control. In: ICML (2018)

    Google Scholar 

  63. Schödl, A., Szeliski, R., Salesin, D.H., Essa, I.: Video textures. In: PACMCGIT (2000)

    Google Scholar 

  64. Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)

  65. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)

    Google Scholar 

  66. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  67. Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)

    Google Scholar 

  68. Sugimoto, R., He, M., Liao, J., Sander, P.V.: Water simulation and rendering from a still photograph. In: SIGGRAPH Asia (2022)

    Google Scholar 

  69. Szummer, M., Picard, R.W.: Temporal texture modeling. In: ICIP (1996)

    Google Scholar 

  70. Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: ECCV, pp. 402–419 (2020)

    Google Scholar 

  71. Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: IROS (2012)

    Google Scholar 

  72. Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

  73. Villegas, R., et al.: Phenaki: variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399 (2022)

  74. Wang, X., et al.: Videocomposer: compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018 (2023)

  75. Wang, Y., et al.: Lavie: high-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103 (2023)

  76. Wei, L.Y., Levoy, M.: Fast texture synthesis using tree-structured vector quantization. In: PACMCGIT (2000)

    Google Scholar 

  77. Wei, Y., et al.: Dreamvideo: composing your dream videos with customized subject and motion. arXiv preprint arXiv:2312.04433 (2023)

  78. Weng, C.Y., Curless, B., Kemelmacher-Shlizerman, I.: Photo wake-up: 3D character animation from a single photo. In: CVPR (2019)

    Google Scholar 

  79. Wu, J., Lim, J.J., Zhang, H., Tenenbaum, J.B., Freeman, W.T.: Physics 101: learning physical object properties from unlabeled videos. In: BMVC (2016)

    Google Scholar 

  80. Wu, J., Lu, E., Kohli, P., Freeman, W.T., Tenenbaum, J.B.: Learning to see physics via visual de-animation. In: NeurIPS (2017)

    Google Scholar 

  81. Wu, J., Yildirim, I., Lim, J.J., Freeman, W.T., Tenenbaum, J.B.: Galileo: perceiving physical object properties by integrating a physics engine with deep learning. In: NeurIPS (2015)

    Google Scholar 

  82. Wu, R., Chen, L., Yang, T., Guo, C., Li, C., Zhang, X.: Lamp: learn a motion pattern for few-shot-based video generation. arXiv preprint arXiv:2310.10769 (2023)

  83. Xia, H., Lin, Z.H., Ma, W.C., Wang, S.: Video2game: real-time, interactive, realistic and browser-compatible environment from a single video. In: CVPR (2024)

    Google Scholar 

  84. Xie, T., et al.: PhysGaussian: physics-integrated 3d gaussians for generative dynamics. In: CVPR (2024)

    Google Scholar 

  85. Xing, J., et al.: DynamiCrafter: animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190 (2023)

  86. Xu, Z., Wu, J., Zeng, A., Tenenbaum, J.B., Song, S.: DensePhysNet: learning dense physical object representations via multi-step dynamic interactions. In: RSS (2019)

    Google Scholar 

  87. Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: stochastic future generation via layered cross convolutional networks. T-PAMI (2018)

    Google Scholar 

  88. Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4v. arXiv preprint arXiv:2310.11441 (2023)

  89. Yu, J., et al.: Animatezero: video diffusion models are zero-shot image animators. arXiv preprint arXiv:2312.03793 (2023)

  90. Yu, L., et al.: Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737 (2023)

  91. Yu, T., et al.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)

  92. Zhai, A.J., et al.: Physical property understanding from language-embedded feature fields. In: CVPR (2024)

    Google Scholar 

  93. Zhang, S., et al.: I2VGen-XL: high-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145 (2023)

  94. Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: CVPR (2022)

    Google Scholar 

Download references

Acknowledgements

This project is supported by NSF Awards #2331878, #2340254, and #2312102, the IBM IIDAI Grant, and an Intel Research Gift. We greatly appreciate the NCSA for providing computing resources. We thank Tianhang cheng for helpful discussions. We thank Emily Chen and Gloria Wang for proofreading.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shenlong Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5996 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, S., Ren, Z., Gupta, S., Wang, S. (2025). PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15140. Springer, Cham. https://doi.org/10.1007/978-3-031-73007-8_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73007-8_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73006-1

  • Online ISBN: 978-3-031-73007-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics