Abstract
In recent years, diffusion models have emerged as a powerful approach in the field of image synthesis. However, applying these models directly to video synthesis presents challenges, often leading to noticeable flickering in the content. Although recently proposed zero-shot methods can alleviate flickering to some extent, generating coherent videos remains a struggle. In this paper, we propose DiffSynth, a novel approach that converts image synthesis pipelines into video synthesis pipelines. DiffSynth consists of two key components: a latent in-iteration deflickering framework and a video deflickering algorithm. The latent in-iteration deflickering framework applies video deflickering in the latent space of diffusion models, effectively preventing flicker accumulation in intermediate steps. Additionally, we introduce a video deflickering algorithm, named the patch blending algorithm, which remaps objects across different frames and blends them to enhance video consistency. One of the notable advantages of DiffSynth is its general applicability to various video synthesis tasks, including text-guided video stylization, fashion video synthesis, image-guided video stylization, video restoration, and 3D rendering. In the task of text-guided video stylization, we make it possible to synthesize high-quality videos without cherry-picking. The experimental results demonstrate the effectiveness of DiffSynth, and we further showcase its practical value on Alibaba e-commerce platform.
Z. Wu—Individual Researcher.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
References
Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24 (2009)
Blattmann, A., et al.: Align your Latents: high-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575 (2023)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2video: video editing using image diffusion. arXiv preprint arXiv:2303.12688 (2023)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. Adv. Neural. Inf. Process. Syst. 34, 8780–8794 (2021)
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011 (2023)
Feng, Z., et al.: ERNIE-ViLG 2.0: improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10135–10145 (2023)
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: The Eleventh International Conference on Learning Representations (2022)
Han, S., Huang, P., Wang, H., Yu, E., Liu, D., Pan, X.: MAT: motion-aware multi-object tracking. Neurocomputing 476, 75–86 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2021)
Jamriška, O., et al.: Stylizing video by example. ACM Transactions on Graphics (TOG) 38(4), 1–11 (2019)
Karras, J., Holynski, A., Wang, T.C., Kemelmacher-Shlizerman, I.: DreamPose: fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025 (2023)
Khachatryan, L., et al.: Text2Video-Zero: text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations (2013)
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Lei, C., Ren, X., Zhang, Z., Chen, Q.: Blind video deflickering by neural filtering with a flawed atlas. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10439–10448 (2023)
Press, W.H., Teukolsky, S.A.: Savitzky-Golay smoothing filters. Comput. Phys. 4(6), 669–672 (1990)
Qi, C., et al.: FateZero: fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III, pp. 234–241. Springer International Publishing, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Singer, U., et al.: Make-a-Video: text-to-video generation without text-video data. In: The Eleventh International Conference on Learning Representations (2022)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II, pp. 402–419. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Witteveen, S., Andrews, M.: Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462 (2022)
Wu, J.Z., et al.: Tune-a-Video: one-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565 (2022)
Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a Video: zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954 (2023)
Yu, S., et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations (2021)
Zablotskaia, P., Siarohin, A., Zhao, B., Sigal, L.: DwNet: dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139 (2019)
Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
Acknowledgments
This work was supported by the National Natural Science Foundation of China under grant Number 62202170, Fundamental Research Funds for the Central Universities under grant Number YBNLTS2023-014, and Alibaba Group through the Alibaba Innovation Research Program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Duan, Z. et al. (2024). DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis. In: Bifet, A., Krilavičius, T., Miliou, I., Nowaczyk, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14950. Springer, Cham. https://doi.org/10.1007/978-3-031-70381-2_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-70381-2_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70380-5
Online ISBN: 978-3-031-70381-2
eBook Packages: Computer ScienceComputer Science (R0)