PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

Liu, Shaowei; Ren, Zhongzheng; Gupta, Saurabh; Wang, Shenlong

doi:10.1007/978-3-031-73007-8_21

Shaowei Liu¹³,
Zhongzheng Ren¹³,
Saurabh Gupta¹³ &
…
Shenlong Wang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15140))

Included in the following conference series:

European Conference on Computer Vision

402 Accesses

Abstract

We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen’s resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics.

S. Gupta and S. Wang—Equal advising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

High-Quality Video Generation from Static Structural Annotations

Article 28 May 2020

3DEgo: 3D Editing on the Go!

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

References

Aberman, K., Weng, Y., Lischinski, D., Cohen-Or, D., Chen, B.: Unpaired motion style transfer from video to animation. ACM TOG (2020)
Google Scholar
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Ajay, A., et al.: Combining physical simulators and object-based networks for control. In: ICRA (2019)
Google Scholar
Ajay, A., et al.: Augmenting physical simulators with stochastic neural networks: case study of planar pushing and bouncing. In: IROS (2018)
Google Scholar
Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. IJCV (2011)
Google Scholar
Bar-Tal, O., et al.: Lumiere: a space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945 (2024)
Battaglia, P., Pascanu, R., Lai, M., Jimenez Rezende, D., et al.: Interaction networks for learning about objects, relations and physics. In: NeurIPS (2016)
Google Scholar
Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
Blattmann, A., Milbich, T., Dorkenwald, M., Ommer, B.: iPOKE: poking a still image for controlled stochastic video synthesis. In: ICCV (2021)
Google Scholar
Blattmann, A., Milbich, T., Dorkenwald, M., Ommer, B.: Understanding object dynamics for interactive image-to-video synthesis. In: CVPR (2021)
Google Scholar
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR (2023)
Google Scholar
Blomqvist, V.: Pymunk (2023). https://pymunk.org
Bowen, R.S., Tucker, R., Zabih, R., Snavely, N.: Dimensions of motion: monocular prediction through flow subspaces. In: 3DV (2022)
Google Scholar
Careaga, C., Miangoleh, S.M.H., Aksoy, Y.: Intrinsic harmonization for illumination-aware compositing. arXiv preprint arXiv:2312.03698 (2023)
Chen, H., et al.: Videocrafter1: open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023)
Chen, X., et al.: Livephoto: real image animation with text-guided motion control. arXiv preprint arXiv:2312.02928 (2023)
Chen, X., et al.: Seine: short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700 (2023)
Chuang, Y.Y., Goldman, D.B., Zheng, K.C., Curless, B., Salesin, D.H., Szeliski, R.: Animating pictures with stochastic motion textures. ACM TOG (2005)
Google Scholar
Ciarlet, P.G., Lions, J.L.: Handbook of Numerical Analysis. Gulf Professional Publishing (1990)
Google Scholar
Davis, A., Bouman, K.L., Chen, J.G., Rubinstein, M., Durand, F., Freeman, W.T.: Visual vibrometry: estimating material properties from small motion in video. In: CVPR (2015)
Google Scholar
Davis, A., Chen, J.G., Durand, F.: Image-space modal bases for plausible manipulation of objects in video. ACM TOG (2015)
Google Scholar
Eftekhar, A., Sax, A., Malik, J., Zamir, A.: Omnidata: a scalable pipeline for making multi-task mid-level vision datasets from 3D scans. In: ICCV (2021)
Google Scholar
Endo, Y., Kanamori, Y., Kuriyama, S.: Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis. arXiv preprint arXiv:1910.07192 (2019)
Friedland, B.: Control system design: an introduction to state-space methods. Courier Corporation (2012)
Google Scholar
Fu, X., et al.: Geowizard: unleashing the diffusion priors for 3D geometry estimation from a single image. arXiv preprint arXiv:2403.12013 (2024)
Ge, S., et al.: Preserve your own correlation: a noise prior for video diffusion models. In: ICCV (2023)
Google Scholar
Geng, D., Owens, A.: Motion guidance: diffusion-based image editing with differentiable motion estimators. In: ICLR (2024)
Google Scholar
Girdhar, R., et al.: Emu video: factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023)
Guo, Y., et al.: Animatediff: animate your personalized text-to-image diffusion models without specific tuning (2023)
Google Scholar
Gupta, A., et al.: Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662 (2023)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NeurIPS (2017)
Google Scholar
Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Holynski, A., Curless, B.L., Seitz, S.M., Szeliski, R.: Animating pictures with Eulerian motion fields. In: CVPR (2021)
Google Scholar
Hu, Y., et al.: Difftaichi: differentiable programming for physical simulation. In: ICLR (2020)
Google Scholar
Hu, Y., et al.: A moving least squares material point method with displacement discontinuity and two-way rigid body coupling. ACM TOG (2018)
Google Scholar
Hu, Y., Li, T.M., Anderson, L., Ragan-Kelley, J., Durand, F.: Taichi: a language for high-performance computation on spatially sparse data structures. ACM TOG (2019)
Google Scholar
Jhou, W.C., Cheng, W.H.: Animating still landscape photographs through cloud motion creation. IEEE Trans. Multimed. (2015)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Koenig, N., Howard, A.: Design and use paradigms for gazebo, an open-source multi-robot simulator. In: IROS (2004)
Google Scholar
Kondratyuk, D., et al.: VideoPoet: a large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023)
Li, X., et al.: PAC-NeRF: physics augmented continuum neural radiance fields for geometry-agnostic system identification. In: ICLR (2023)
Google Scholar
Li, Y., et al.: Visual grounding of learned physical models. In: ICML (2020)
Google Scholar
Li, Y., Wu, J., Tedrake, R., Tenenbaum, J.B., Torralba, A.: Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In: ICLR (2019)
Google Scholar
Li, Z., Tucker, R., Snavely, N., Holynski, A.: Generative image dynamics. arXiv preprint arXiv:2309.07906 (2023)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, T., Bargteil, A.W., O’Brien, J.F., Kavan, L.: Fast simulation of mass-spring systems. ACM TOG (2013)
Google Scholar
Lv, J., et al.: GPT4motion: scripting physical motions in text-to-video generation via blender-oriented GPT planning. In: CVPRW, pp. 1430–1440 (2024)
Google Scholar
Mahapatra, A., Kulkarni, K.: Controllable animation of fluid elements in still images. In: CVPR (2022)
Google Scholar
Mallya, A., Wang, T.C., Liu, M.Y.: Implicit warping for animation with image sets. In: NeurIPS (2022)
Google Scholar
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)
Google Scholar
Mrowca, D., et al.: Flexible neural representation for physics prediction. In: NeurIPS (2018)
Google Scholar
NVIDIA: Nvidia Physx (2019). https://developer.nvidia.com/physx-sdk
OpenAI: GPT-4v(ISION) system card (2023)
Google Scholar
OpenAI: Creating video from text (2024). https://openai.com/sora
Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)
Popova, E., Popov, V.L.: The research works of coulomb and amontons and generalized laws of friction. Friction (2015)
Google Scholar
Qiu, H., et al.: Relitalk: relightable talking portrait generation from a single video (2023)
Google Scholar
Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks (2024)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Sanchez-Gonzalez, A., et al.: Graph networks as learnable physics engines for inference and control. In: ICML (2018)
Google Scholar
Schödl, A., Szeliski, R., Salesin, D.H., Essa, I.: Video textures. In: PACMCGIT (2000)
Google Scholar
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)
Google Scholar
Sugimoto, R., He, M., Liao, J., Sander, P.V.: Water simulation and rendering from a still photograph. In: SIGGRAPH Asia (2022)
Google Scholar
Szummer, M., Picard, R.W.: Temporal texture modeling. In: ICIP (1996)
Google Scholar
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: ECCV, pp. 402–419 (2020)
Google Scholar
Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: IROS (2012)
Google Scholar
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)
Villegas, R., et al.: Phenaki: variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399 (2022)
Wang, X., et al.: Videocomposer: compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018 (2023)
Wang, Y., et al.: Lavie: high-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103 (2023)
Wei, L.Y., Levoy, M.: Fast texture synthesis using tree-structured vector quantization. In: PACMCGIT (2000)
Google Scholar
Wei, Y., et al.: Dreamvideo: composing your dream videos with customized subject and motion. arXiv preprint arXiv:2312.04433 (2023)
Weng, C.Y., Curless, B., Kemelmacher-Shlizerman, I.: Photo wake-up: 3D character animation from a single photo. In: CVPR (2019)
Google Scholar
Wu, J., Lim, J.J., Zhang, H., Tenenbaum, J.B., Freeman, W.T.: Physics 101: learning physical object properties from unlabeled videos. In: BMVC (2016)
Google Scholar
Wu, J., Lu, E., Kohli, P., Freeman, W.T., Tenenbaum, J.B.: Learning to see physics via visual de-animation. In: NeurIPS (2017)
Google Scholar
Wu, J., Yildirim, I., Lim, J.J., Freeman, W.T., Tenenbaum, J.B.: Galileo: perceiving physical object properties by integrating a physics engine with deep learning. In: NeurIPS (2015)
Google Scholar
Wu, R., Chen, L., Yang, T., Guo, C., Li, C., Zhang, X.: Lamp: learn a motion pattern for few-shot-based video generation. arXiv preprint arXiv:2310.10769 (2023)
Xia, H., Lin, Z.H., Ma, W.C., Wang, S.: Video2game: real-time, interactive, realistic and browser-compatible environment from a single video. In: CVPR (2024)
Google Scholar
Xie, T., et al.: PhysGaussian: physics-integrated 3d gaussians for generative dynamics. In: CVPR (2024)
Google Scholar
Xing, J., et al.: DynamiCrafter: animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190 (2023)
Xu, Z., Wu, J., Zeng, A., Tenenbaum, J.B., Song, S.: DensePhysNet: learning dense physical object representations via multi-step dynamic interactions. In: RSS (2019)
Google Scholar
Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: stochastic future generation via layered cross convolutional networks. T-PAMI (2018)
Google Scholar
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4v. arXiv preprint arXiv:2310.11441 (2023)
Yu, J., et al.: Animatezero: video diffusion models are zero-shot image animators. arXiv preprint arXiv:2312.03793 (2023)
Yu, L., et al.: Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737 (2023)
Yu, T., et al.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
Zhai, A.J., et al.: Physical property understanding from language-embedded feature fields. In: CVPR (2024)
Google Scholar
Zhang, S., et al.: I2VGen-XL: high-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145 (2023)
Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: CVPR (2022)
Google Scholar

Download references

Acknowledgements

This project is supported by NSF Awards #2331878, #2340254, and #2312102, the IBM IIDAI Grant, and an Intel Research Gift. We greatly appreciate the NCSA for providing computing resources. We thank Tianhang cheng for helpful discussions. We thank Emily Chen and Gloria Wang for proofreading.

Author information

Authors and Affiliations

University of Illinois Urbana-Champaign, Champaign, USA
Shaowei Liu, Zhongzheng Ren, Saurabh Gupta & Shenlong Wang

Authors

Shaowei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongzheng Ren
View author publications
You can also search for this author in PubMed Google Scholar
Saurabh Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Shenlong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shenlong Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5996 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, S., Ren, Z., Gupta, S., Wang, S. (2025). PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15140. Springer, Cham. https://doi.org/10.1007/978-3-031-73007-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-73007-8_21
Published: 01 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73006-1
Online ISBN: 978-3-031-73007-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics