Video Editing via Factorized Diffusion Distillation

Singer, Uriel; Zohar, Amit; Kirstain, Yuval; Sheynin, Shelly; Polyak, Adam; Parikh, Devi; Taigman, Yaniv

doi:10.1007/978-3-031-73116-7_26

Uriel Singer¹³,
Amit Zohar¹³,
Yuval Kirstain¹³,
Shelly Sheynin¹³,
Adam Polyak¹³,
Devi Parikh¹³ &
…
Yaniv Taigman¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15134))

Included in the following conference series:

European Conference on Computer Vision

338 Accesses

Abstract

We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters.

U. Singer and A. Zohar—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

VIDEOSHOP: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing

DragVideo: Interactive Drag-Style Video Editing

Notes

1.
To enable the editing adapter to process videos, we stack the frames independently as a batch.
2.
https://fdd-video-edit.github.io/.

References

Bar-Tal, O., et al.: Lumiere: a space-time diffusion model for video generation. ArXiv abs/2401.12945 (2024). https://api.semanticscholar.org/CorpusID:267095113
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR (2023)
Google Scholar
Brooks, T., et al.: Video generation models as world simulators (2024). https://openai.com/research/video-generation-models-as-world-simulators
Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 23149–23160 (2023). https://api.semanticscholar.org/CorpusID:257663916
Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. ArXiv abs/2311.00213 (2023). https://api.semanticscholar.org/CorpusID:264833165
Cheng, J., Xiao, T., He, T.: Consistent video-to-video transfer using synthetic dataset. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=IoKRezZMxF
Dai, X., et al.: Emu: enhancing image generation models using photogenic needles in a haystack (2023)
Google Scholar
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7312–7322 (2023). https://api.semanticscholar.org/CorpusID:256615582
Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: CLIP-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946 (2021)
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: TokenFlow: consistent diffusion features for consistent video editing. ArXiv abs/2307.10373 (2023). https://api.semanticscholar.org/CorpusID:259991741
Girdhar, R., et al.: Emu video: factorizing text-to-video generation by explicit image conditioning (2023)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial networks. In: 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), pp. 1–7 (2022). https://api.semanticscholar.org/CorpusID:1033682
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=nZeVKeeFYf9
Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117 (2023)
Kara, O., Kurtkaya, B., Yesiltepe, H., Rehg, J.M., Yanardag, P.: RAVE: randomized noise shuffling for fast and consistent video editing with diffusion models. arXiv preprint arXiv:2312.04524 (2023)
Khachatryan, L., et al.: Text2Video-zero: text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-Pic: an open dataset of user preferences for text-to-image generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Li, X., Ma, C., Yang, X., Yang, M.H.: VidToMe: video token merging for zero-shot video editing. arXiv preprint arXiv:2312.10656 (2023)
Liang, F., et al.: FlowVid: taming imperfect optical flows for consistent video-to-video synthesis. ArXiv abs/2312.17681 (2023). https://api.semanticscholar.org/CorpusID:266690780
Lim, J.H., Ye, J.C.: Geometric GAN. arXiv preprint arXiv:1705.02894 (2017)
Lin, S., Liu, B., Li, J., Yang, X.: Common diffusion noise schedules and sample steps are flawed. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5404–5411 (2024)
Google Scholar
Ma, H., et al.: MaskINT: video editing via interpolative non-autoregressive masked transformers. arxiv preprint (2023)
Google Scholar
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)
Meng, C., et al.: On distillation of guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14297–14306 (2023)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=FjNys5c7VyY
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
Google Scholar
Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. ArXiv abs/2311.17042 (2023). https://api.semanticscholar.org/CorpusID:265466173
Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation (2023)
Google Scholar
Sheynin, S., et al.: Emu edit: precise image editing via recognition and generation tasks (2023)
Google Scholar
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, W., et al.: Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599 (2023)
Wang, Y., et al.: InternVid: a large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)
Wu, B., et al.: Fairy: fast parallelized instruction-guided video-to-video synthesis. ArXiv abs/2312.13834 (2023). https://api.semanticscholar.org/CorpusID:266435967
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)
Google Scholar
Wu, J.Z., et al.: CVPR 2023 text guided video editing competition. arXiv preprint arXiv:2310.16003 (2023)
Yan, W., Brown, A., Abbeel, P., Girdhar, R., Azadi, S.: Motion-conditioned image animation for video editing. ArXiv abs/2311.18827 (2023). https://api.semanticscholar.org/CorpusID:265506378
Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: zero-shot text-guided video-to-video translation. In: ACM SIGGRAPH Asia 2023 Conference Proceedings (2023)
Google Scholar
Yang, S., Mou, C., Yu, J., Wang, Y., Meng, X., Zhang, J.: Neural video fields editing. arXiv preprint arXiv:2312.08882 (2023)
Yatim, D., Fridman, R., Tal, O.B., Kasten, Y., Dekel, T.: Space-time diffusion features for zero-shot text-driven motion transfer. arXiv preprint arXiv:2311.17009 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3836–3847, October 2023
Google Scholar

Download references

Acknowledgements

Andrew Brown, Bichen Wu, Ishan Misra, Saketh Rambhatla, Xiaoliang Dai, Zijian He. Thank you for your contributions!

Author information

Authors and Affiliations

Meta AI, New York, USA
Uriel Singer, Amit Zohar, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh & Yaniv Taigman

Authors

Uriel Singer
View author publications
You can also search for this author in PubMed Google Scholar
Amit Zohar
View author publications
You can also search for this author in PubMed Google Scholar
Yuval Kirstain
View author publications
You can also search for this author in PubMed Google Scholar
Shelly Sheynin
View author publications
You can also search for this author in PubMed Google Scholar
Adam Polyak
View author publications
You can also search for this author in PubMed Google Scholar
Devi Parikh
View author publications
You can also search for this author in PubMed Google Scholar
Yaniv Taigman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuval Kirstain .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singer, U. et al. (2025). Video Editing via Factorized Diffusion Distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15134. Springer, Cham. https://doi.org/10.1007/978-3-031-73116-7_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-73116-7_26
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73115-0
Online ISBN: 978-3-031-73116-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Video Editing via Factorized Diffusion Distillation