Skip to main content

Video Editing via Factorized Diffusion Distillation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15134))

Included in the following conference series:

  • 338 Accesses

Abstract

We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters.

U. Singer and A. Zohar—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    To enable the editing adapter to process videos, we stack the frames independently as a batch.

  2. 2.

    https://fdd-video-edit.github.io/.

References

  1. Bar-Tal, O., et al.: Lumiere: a space-time diffusion model for video generation. ArXiv abs/2401.12945 (2024). https://api.semanticscholar.org/CorpusID:267095113

  2. Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR (2023)

    Google Scholar 

  3. Brooks, T., et al.: Video generation models as world simulators (2024). https://openai.com/research/video-generation-models-as-world-simulators

  4. Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 23149–23160 (2023). https://api.semanticscholar.org/CorpusID:257663916

  5. Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. ArXiv abs/2311.00213 (2023). https://api.semanticscholar.org/CorpusID:264833165

  6. Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=IoKRezZMxF

  7. Dai, X., et al.: Emu: enhancing image generation models using photogenic needles in a haystack (2023)

    Google Scholar 

  8. Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7312–7322 (2023). https://api.semanticscholar.org/CorpusID:256615582

  9. Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: CLIP-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946 (2021)

  10. Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: TokenFlow: consistent diffusion features for consistent video editing. ArXiv abs/2307.10373 (2023). https://api.semanticscholar.org/CorpusID:259991741

  11. Girdhar, R., et al.: Emu video: factorizing text-to-video generation by explicit image conditioning (2023)

    Google Scholar 

  12. Goodfellow, I.J., et al.: Generative adversarial networks. In: 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), pp. 1–7 (2022). https://api.semanticscholar.org/CorpusID:1033682

  13. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

  14. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=nZeVKeeFYf9

  15. Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117 (2023)

  16. Kara, O., Kurtkaya, B., Yesiltepe, H., Rehg, J.M., Yanardag, P.: RAVE: randomized noise shuffling for fast and consistent video editing with diffusion models. arXiv preprint arXiv:2312.04524 (2023)

  17. Khachatryan, L., et al.: Text2Video-zero: text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)

  18. Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-Pic: an open dataset of user preferences for text-to-image generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  19. Li, X., Ma, C., Yang, X., Yang, M.H.: VidToMe: video token merging for zero-shot video editing. arXiv preprint arXiv:2312.10656 (2023)

  20. Liang, F., et al.: FlowVid: taming imperfect optical flows for consistent video-to-video synthesis. ArXiv abs/2312.17681 (2023). https://api.semanticscholar.org/CorpusID:266690780

  21. Lim, J.H., Ye, J.C.: Geometric GAN. arXiv preprint arXiv:1705.02894 (2017)

  22. Lin, S., Liu, B., Li, J., Yang, X.: Common diffusion noise schedules and sample steps are flawed. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5404–5411 (2024)

    Google Scholar 

  23. Ma, H., et al.: MaskINT: video editing via interpolative non-autoregressive masked transformers. arxiv preprint (2023)

    Google Scholar 

  24. Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)

  25. Meng, C., et al.: On distillation of guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14297–14306 (2023)

    Google Scholar 

  26. Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  27. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=FjNys5c7VyY

  28. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  29. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)

    Google Scholar 

  30. Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. ArXiv abs/2311.17042 (2023). https://api.semanticscholar.org/CorpusID:265466173

  31. Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation (2023)

    Google Scholar 

  32. Sheynin, S., et al.: Emu edit: precise image editing via recognition and generation tasks (2023)

    Google Scholar 

  33. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  34. Wang, W., et al.: Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599 (2023)

  35. Wang, Y., et al.: InternVid: a large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)

  36. Wu, B., et al.: Fairy: fast parallelized instruction-guided video-to-video synthesis. ArXiv abs/2312.13834 (2023). https://api.semanticscholar.org/CorpusID:266435967

  37. Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)

    Google Scholar 

  38. Wu, J.Z., et al.: CVPR 2023 text guided video editing competition. arXiv preprint arXiv:2310.16003 (2023)

  39. Yan, W., Brown, A., Abbeel, P., Girdhar, R., Azadi, S.: Motion-conditioned image animation for video editing. ArXiv abs/2311.18827 (2023). https://api.semanticscholar.org/CorpusID:265506378

  40. Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: zero-shot text-guided video-to-video translation. In: ACM SIGGRAPH Asia 2023 Conference Proceedings (2023)

    Google Scholar 

  41. Yang, S., Mou, C., Yu, J., Wang, Y., Meng, X., Zhang, J.: Neural video fields editing. arXiv preprint arXiv:2312.08882 (2023)

  42. Yatim, D., Fridman, R., Tal, O.B., Kasten, Y., Dekel, T.: Space-time diffusion features for zero-shot text-driven motion transfer. arXiv preprint arXiv:2311.17009 (2023)

  43. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3836–3847, October 2023

    Google Scholar 

Download references

Acknowledgements

Andrew Brown, Bichen Wu, Ishan Misra, Saketh Rambhatla, Xiaoliang Dai, Zijian He. Thank you for your contributions!

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuval Kirstain .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Singer, U. et al. (2025). Video Editing via Factorized Diffusion Distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15134. Springer, Cham. https://doi.org/10.1007/978-3-031-73116-7_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73116-7_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73115-0

  • Online ISBN: 978-3-031-73116-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics