COLORSHOP: Color Manipulation of Objects in Videos Using Diffusion Models

Huang, Hongbo; Huang, Linkai; Xu, Longfei; Wu, Lishan; Zhang, Xin

doi:10.1007/978-981-97-5612-4_29

Hongbo Huang¹⁰,
Linkai Huang¹⁰,
Longfei Xu¹⁰,
Lishan Wu¹⁰ &
…
Xin Zhang¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14872))

Included in the following conference series:

International Conference on Intelligent Computing

506 Accesses

Abstract

Diffusion models have unlocked unprecedented capabilities in image generation, while their video counterparts still lag behind due to the excessive training cost of temporal modeling. Besides the training burden, generated videos also suffer from issues of inconsistent appearance and structural flickering. To tackle these challenges, we have designed an optimization-free and zero fine-tuning framework called COLORSHOP to implementing editing of the appearance color of objects in a video based on the continuity of VAE in the latent space. When processing each frame, we introduce Foreground diffusion to accelerate the operation speed. During the generation process, we further propose Cross-Frame spatial feature fusion to enhance foreground continuity across frames. Experimental results have shown that, by combining the currently popular diffusion-based image editing algorithm, COLORSHOP has been proven successful in video editing tasks, demonstrating excellent performance in terms of consistency and quality.

H. Huang, L., Huang, L. Xu—Contributed equally to the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

OmniFusion: Exemplar-Based Video Colorization Using OmniMotion and DifFusion Priors

DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency

VIDEOSHOP: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

References

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP Latents 1(2), 3 (2022). arXiv preprint arXiv:2204.06125
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Google Scholar
Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q.: ControlVideo: training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077 (2023)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Huang, X., Liu, M.-Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11207, pp. 179–196. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_11
Chapter Google Scholar
Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Google Scholar
Abdal, R., Qin, Y., Wonka, P.: Image2StyleGAN: how to embed images into the StyleGAN latent space? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4432–4441 (2019)
Google Scholar
Alaluf, Y., Tov, O., Mokady, R., Gal, R., Bermano, A.: HyperStyle: StyleGAN inversion with HypernetWorks for real image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18511–18521 (2022)
Google Scholar
Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1921–1930 (2023)
Google Scholar
Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.-Y.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
Google Scholar
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)
Google Scholar
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention — MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Qi, C., et al.: FateZero: fusing attentions for zero-shot text-based video editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15932–15942 (2023)
Google Scholar
Song, J., Meng, C., Ermon: denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Khachatryan, L., et al.: Text2Video-Zero: text-to-image diffusion models are zero-shot video generators. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15954–15964 (2023)
Google Scholar
Chai, W., Guo, X., Wang, G., Lu, Y.: StableVideo: text-driven consistency-aware diffusion video editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23040–23050 (2023)
Google Scholar
Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: zero-shot text-guided video-to-video translation. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–11 (2023)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-P2P: video editing with cross-attention control. arXiv preprint arXiv:2303.04761 (2023)

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (62376286).

Author information

Authors and Affiliations

Beijing Information Science and Technology University, Beijing, China
Hongbo Huang, Linkai Huang, Longfei Xu, Lishan Wu & Xin Zhang

Authors

Hongbo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Linkai Huang
View author publications
You can also search for this author in PubMed Google Scholar
Longfei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Lishan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongbo Huang .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Eastern Institute of Technology, Ningbo, China
Yijie Pan
Eastern Institute of Technology, Ningbo, China
Qinhu Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, H., Huang, L., Xu, L., Wu, L., Zhang, X. (2024). COLORSHOP: Color Manipulation of Objects in Videos Using Diffusion Models. In: Huang, DS., Pan, Y., Zhang, Q. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14872. Springer, Singapore. https://doi.org/10.1007/978-981-97-5612-4_29

Download citation

DOI: https://doi.org/10.1007/978-981-97-5612-4_29
Published: 31 July 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5611-7
Online ISBN: 978-981-97-5612-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics