DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

Duan, Zhongjie; You, Lizhou; Wang, Chengyu; Chen, Cen; Wu, Ziheng; Qian, Weining; Huang, Jun

doi:10.1007/978-3-031-70381-2_21

Zhongjie Duan¹¹,
Lizhou You¹²,
Chengyu Wang¹³,
Cen Chen¹¹,
Ziheng Wu¹⁴,
Weining Qian¹¹ &
…
Jun Huang¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14950))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

783 Accesses
3 Citations

Abstract

In recent years, diffusion models have emerged as a powerful approach in the field of image synthesis. However, applying these models directly to video synthesis presents challenges, often leading to noticeable flickering in the content. Although recently proposed zero-shot methods can alleviate flickering to some extent, generating coherent videos remains a struggle. In this paper, we propose DiffSynth, a novel approach that converts image synthesis pipelines into video synthesis pipelines. DiffSynth consists of two key components: a latent in-iteration deflickering framework and a video deflickering algorithm. The latent in-iteration deflickering framework applies video deflickering in the latent space of diffusion models, effectively preventing flicker accumulation in intermediate steps. Additionally, we introduce a video deflickering algorithm, named the patch blending algorithm, which remaps objects across different frames and blends them to enhance video consistency. One of the notable advantages of DiffSynth is its general applicability to various video synthesis tasks, including text-guided video stylization, fashion video synthesis, image-guided video stylization, video restoration, and 3D rendering. In the task of text-guided video stylization, we make it possible to synthesize high-quality videos without cherry-picking. The experimental results demonstrate the effectiveness of DiffSynth, and we further showcase its practical value on Alibaba e-commerce platform.

Z. Wu—Individual Researcher.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models

Article 23 December 2024

Editable Image Elements for Controllable Synthesis

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

Notes

References

Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24 (2009)
Article Google Scholar
Blattmann, A., et al.: Align your Latents: high-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575 (2023)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Google Scholar
Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2video: video editing using image diffusion. arXiv preprint arXiv:2303.12688 (2023)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. Adv. Neural. Inf. Process. Syst. 34, 8780–8794 (2021)
Google Scholar
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011 (2023)
Feng, Z., et al.: ERNIE-ViLG 2.0: improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10135–10145 (2023)
Google Scholar
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Han, S., Huang, P., Wang, H., Yu, E., Liu, D., Pan, X.: MAT: motion-aware multi-object tracking. Neurocomputing 476, 75–86 (2022)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)
Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2021)
Google Scholar
Jamriška, O., et al.: Stylizing video by example. ACM Transactions on Graphics (TOG) 38(4), 1–11 (2019)
Article Google Scholar
Karras, J., Holynski, A., Wang, T.C., Kemelmacher-Shlizerman, I.: DreamPose: fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025 (2023)
Khachatryan, L., et al.: Text2Video-Zero: text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations (2013)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Lei, C., Ren, X., Zhang, Z., Chen, Q.: Blind video deflickering by neural filtering with a flawed atlas. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10439–10448 (2023)
Google Scholar
Press, W.H., Teukolsky, S.A.: Savitzky-Golay smoothing filters. Comput. Phys. 4(6), 669–672 (1990)
Article Google Scholar
Qi, C., et al.: FateZero: fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III, pp. 234–241. Springer International Publishing, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Google Scholar
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Google Scholar
Singer, U., et al.: Make-a-Video: text-to-video generation without text-video data. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II, pp. 402–419. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Witteveen, S., Andrews, M.: Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462 (2022)
Wu, J.Z., et al.: Tune-a-Video: one-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565 (2022)
Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a Video: zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954 (2023)
Yu, S., et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations (2021)
Google Scholar
Zablotskaia, P., Siarohin, A., Zhao, B., Sigal, L.: DwNet: dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139 (2019)
Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China under grant Number 62202170, Fundamental Research Funds for the Central Universities under grant Number YBNLTS2023-014, and Alibaba Group through the Alibaba Innovation Research Program.

Author information

Authors and Affiliations

East China Normal University, Shanghai, China
Zhongjie Duan, Cen Chen & Weining Qian
Xiamen University, Xiamen, China
Lizhou You
Alibaba Group, Hangzhou, China
Chengyu Wang & Jun Huang
Hangzhou, China
Ziheng Wu

Authors

Zhongjie Duan
View author publications
You can also search for this author in PubMed Google Scholar
Lizhou You
View author publications
You can also search for this author in PubMed Google Scholar
Chengyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Cen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ziheng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Weining Qian
View author publications
You can also search for this author in PubMed Google Scholar
Jun Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cen Chen .

Editor information

Editors and Affiliations

LTCI, Télécom Paris, Palaiseau Cedex, France
Albert Bifet
Faculty of Informatics, Vytautas Magnus University, Akademija, Lithuania
Tomas Krilavičius
Stockholm University, Kista, Sweden
Ioanna Miliou
School of Information Technology, Halmstad University, Halmstad, Sweden
Slawomir Nowaczyk

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duan, Z. et al. (2024). DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis. In: Bifet, A., Krilavičius, T., Miliou, I., Nowaczyk, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14950. Springer, Cham. https://doi.org/10.1007/978-3-031-70381-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-70381-2_21
Published: 22 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70380-5
Online ISBN: 978-3-031-70381-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis