Abstract
We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen’s resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics.
S. Gupta and S. Wang—Equal advising.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aberman, K., Weng, Y., Lischinski, D., Cohen-Or, D., Chen, B.: Unpaired motion style transfer from video to animation. ACM TOG (2020)
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Ajay, A., et al.: Combining physical simulators and object-based networks for control. In: ICRA (2019)
Ajay, A., et al.: Augmenting physical simulators with stochastic neural networks: case study of planar pushing and bouncing. In: IROS (2018)
Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. IJCV (2011)
Bar-Tal, O., et al.: Lumiere: a space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945 (2024)
Battaglia, P., Pascanu, R., Lai, M., Jimenez Rezende, D., et al.: Interaction networks for learning about objects, relations and physics. In: NeurIPS (2016)
Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
Blattmann, A., Milbich, T., Dorkenwald, M., Ommer, B.: iPOKE: poking a still image for controlled stochastic video synthesis. In: ICCV (2021)
Blattmann, A., Milbich, T., Dorkenwald, M., Ommer, B.: Understanding object dynamics for interactive image-to-video synthesis. In: CVPR (2021)
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR (2023)
Blomqvist, V.: Pymunk (2023). https://pymunk.org
Bowen, R.S., Tucker, R., Zabih, R., Snavely, N.: Dimensions of motion: monocular prediction through flow subspaces. In: 3DV (2022)
Careaga, C., Miangoleh, S.M.H., Aksoy, Y.: Intrinsic harmonization for illumination-aware compositing. arXiv preprint arXiv:2312.03698 (2023)
Chen, H., et al.: Videocrafter1: open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023)
Chen, X., et al.: Livephoto: real image animation with text-guided motion control. arXiv preprint arXiv:2312.02928 (2023)
Chen, X., et al.: Seine: short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700 (2023)
Chuang, Y.Y., Goldman, D.B., Zheng, K.C., Curless, B., Salesin, D.H., Szeliski, R.: Animating pictures with stochastic motion textures. ACM TOG (2005)
Ciarlet, P.G., Lions, J.L.: Handbook of Numerical Analysis. Gulf Professional Publishing (1990)
Davis, A., Bouman, K.L., Chen, J.G., Rubinstein, M., Durand, F., Freeman, W.T.: Visual vibrometry: estimating material properties from small motion in video. In: CVPR (2015)
Davis, A., Chen, J.G., Durand, F.: Image-space modal bases for plausible manipulation of objects in video. ACM TOG (2015)
Eftekhar, A., Sax, A., Malik, J., Zamir, A.: Omnidata: a scalable pipeline for making multi-task mid-level vision datasets from 3D scans. In: ICCV (2021)
Endo, Y., Kanamori, Y., Kuriyama, S.: Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis. arXiv preprint arXiv:1910.07192 (2019)
Friedland, B.: Control system design: an introduction to state-space methods. Courier Corporation (2012)
Fu, X., et al.: Geowizard: unleashing the diffusion priors for 3D geometry estimation from a single image. arXiv preprint arXiv:2403.12013 (2024)
Ge, S., et al.: Preserve your own correlation: a noise prior for video diffusion models. In: ICCV (2023)
Geng, D., Owens, A.: Motion guidance: diffusion-based image editing with differentiable motion estimators. In: ICLR (2024)
Girdhar, R., et al.: Emu video: factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023)
Guo, Y., et al.: Animatediff: animate your personalized text-to-image diffusion models without specific tuning (2023)
Gupta, A., et al.: Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662 (2023)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NeurIPS (2017)
Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Holynski, A., Curless, B.L., Seitz, S.M., Szeliski, R.: Animating pictures with Eulerian motion fields. In: CVPR (2021)
Hu, Y., et al.: Difftaichi: differentiable programming for physical simulation. In: ICLR (2020)
Hu, Y., et al.: A moving least squares material point method with displacement discontinuity and two-way rigid body coupling. ACM TOG (2018)
Hu, Y., Li, T.M., Anderson, L., Ragan-Kelley, J., Durand, F.: Taichi: a language for high-performance computation on spatially sparse data structures. ACM TOG (2019)
Jhou, W.C., Cheng, W.H.: Animating still landscape photographs through cloud motion creation. IEEE Trans. Multimed. (2015)
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Koenig, N., Howard, A.: Design and use paradigms for gazebo, an open-source multi-robot simulator. In: IROS (2004)
Kondratyuk, D., et al.: VideoPoet: a large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023)
Li, X., et al.: PAC-NeRF: physics augmented continuum neural radiance fields for geometry-agnostic system identification. In: ICLR (2023)
Li, Y., et al.: Visual grounding of learned physical models. In: ICML (2020)
Li, Y., Wu, J., Tedrake, R., Tenenbaum, J.B., Torralba, A.: Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In: ICLR (2019)
Li, Z., Tucker, R., Snavely, N., Holynski, A.: Generative image dynamics. arXiv preprint arXiv:2309.07906 (2023)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, T., Bargteil, A.W., O’Brien, J.F., Kavan, L.: Fast simulation of mass-spring systems. ACM TOG (2013)
Lv, J., et al.: GPT4motion: scripting physical motions in text-to-video generation via blender-oriented GPT planning. In: CVPRW, pp. 1430–1440 (2024)
Mahapatra, A., Kulkarni, K.: Controllable animation of fluid elements in still images. In: CVPR (2022)
Mallya, A., Wang, T.C., Liu, M.Y.: Implicit warping for animation with image sets. In: NeurIPS (2022)
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)
Mrowca, D., et al.: Flexible neural representation for physics prediction. In: NeurIPS (2018)
NVIDIA: Nvidia Physx (2019). https://developer.nvidia.com/physx-sdk
OpenAI: GPT-4v(ISION) system card (2023)
OpenAI: Creating video from text (2024). https://openai.com/sora
Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)
Popova, E., Popov, V.L.: The research works of coulomb and amontons and generalized laws of friction. Friction (2015)
Qiu, H., et al.: Relitalk: relightable talking portrait generation from a single video (2023)
Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks (2024)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Sanchez-Gonzalez, A., et al.: Graph networks as learnable physics engines for inference and control. In: ICML (2018)
Schödl, A., Szeliski, R., Salesin, D.H., Essa, I.: Video textures. In: PACMCGIT (2000)
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)
Sugimoto, R., He, M., Liao, J., Sander, P.V.: Water simulation and rendering from a still photograph. In: SIGGRAPH Asia (2022)
Szummer, M., Picard, R.W.: Temporal texture modeling. In: ICIP (1996)
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: ECCV, pp. 402–419 (2020)
Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: IROS (2012)
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)
Villegas, R., et al.: Phenaki: variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399 (2022)
Wang, X., et al.: Videocomposer: compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018 (2023)
Wang, Y., et al.: Lavie: high-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103 (2023)
Wei, L.Y., Levoy, M.: Fast texture synthesis using tree-structured vector quantization. In: PACMCGIT (2000)
Wei, Y., et al.: Dreamvideo: composing your dream videos with customized subject and motion. arXiv preprint arXiv:2312.04433 (2023)
Weng, C.Y., Curless, B., Kemelmacher-Shlizerman, I.: Photo wake-up: 3D character animation from a single photo. In: CVPR (2019)
Wu, J., Lim, J.J., Zhang, H., Tenenbaum, J.B., Freeman, W.T.: Physics 101: learning physical object properties from unlabeled videos. In: BMVC (2016)
Wu, J., Lu, E., Kohli, P., Freeman, W.T., Tenenbaum, J.B.: Learning to see physics via visual de-animation. In: NeurIPS (2017)
Wu, J., Yildirim, I., Lim, J.J., Freeman, W.T., Tenenbaum, J.B.: Galileo: perceiving physical object properties by integrating a physics engine with deep learning. In: NeurIPS (2015)
Wu, R., Chen, L., Yang, T., Guo, C., Li, C., Zhang, X.: Lamp: learn a motion pattern for few-shot-based video generation. arXiv preprint arXiv:2310.10769 (2023)
Xia, H., Lin, Z.H., Ma, W.C., Wang, S.: Video2game: real-time, interactive, realistic and browser-compatible environment from a single video. In: CVPR (2024)
Xie, T., et al.: PhysGaussian: physics-integrated 3d gaussians for generative dynamics. In: CVPR (2024)
Xing, J., et al.: DynamiCrafter: animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190 (2023)
Xu, Z., Wu, J., Zeng, A., Tenenbaum, J.B., Song, S.: DensePhysNet: learning dense physical object representations via multi-step dynamic interactions. In: RSS (2019)
Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: stochastic future generation via layered cross convolutional networks. T-PAMI (2018)
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4v. arXiv preprint arXiv:2310.11441 (2023)
Yu, J., et al.: Animatezero: video diffusion models are zero-shot image animators. arXiv preprint arXiv:2312.03793 (2023)
Yu, L., et al.: Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737 (2023)
Yu, T., et al.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
Zhai, A.J., et al.: Physical property understanding from language-embedded feature fields. In: CVPR (2024)
Zhang, S., et al.: I2VGen-XL: high-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145 (2023)
Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: CVPR (2022)
Acknowledgements
This project is supported by NSF Awards #2331878, #2340254, and #2312102, the IBM IIDAI Grant, and an Intel Research Gift. We greatly appreciate the NCSA for providing computing resources. We thank Tianhang cheng for helpful discussions. We thank Emily Chen and Gloria Wang for proofreading.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, S., Ren, Z., Gupta, S., Wang, S. (2025). PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15140. Springer, Cham. https://doi.org/10.1007/978-3-031-73007-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-73007-8_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73006-1
Online ISBN: 978-3-031-73007-8
eBook Packages: Computer ScienceComputer Science (R0)