Temporally Consistent Semantic Video Editing

Xu, Yiran; AlBahar, Badour; Huang, Jia-Bin

doi:10.1007/978-3-031-19784-0_21

Yiran Xu¹²,
Badour AlBahar¹³ &
Jia-Bin Huang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13675))

Included in the following conference series:

European Conference on Computer Vision

2867 Accesses

Abstract

Generative adversarial networks (GANs) have demonstrated impressive image generation quality and semantic editing capability of real images, e.g., changing object classes, modifying attributes, or transferring styles. However, applying these GAN-based editing to a video independently for each frame inevitably results in temporal flickering artifacts. We present a simple yet effective method to facilitate temporally coherent video editing. Our core idea is to minimize the temporal photometric inconsistency by optimizing both the latent code and the pre-trained generator. We evaluate the quality of our editing on different domains and GAN inversion techniques and show favorable results against the baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

In-Domain GAN Inversion for Real Image Editing

VIDEOSHOP: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

Semantic Hierarchy Emerges in Deep Generative Representations for Scene Synthesis

Article 10 February 2021

References

Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: how to embed images into the stylegan latent space? In: ICCV (2019)
Google Scholar
Abdal, R., Qin, Y., Wonka, P.: Image2stylegan++: how to edit the embedded images? In: CVPR (2020)
Google Scholar
Abdal, R., Zhu, P., Mitra, N.J., Wonka, P.: Styleflow: attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Trans. Graph. (TOG) 40(3), 1–21 (2021)
Article Google Scholar
Afifi, M., Brubaker, M.A., Brown, M.S.: Histogan: controlling colors of gan-generated and real images via color histograms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Alaluf, Y., Patashnik, O., Cohen-Or, D.: Only a matter of style: age transformation using a style-based regression model. arXiv preprint arXiv:2102.02754 (2021)
Alaluf, Y., Patashnik, O., Cohen-Or, D.: Restyle: a residual-based stylegan encoder via iterative refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Alaluf, Y., et al.: Third time’s the charm? image and video editing with stylegan3. arXiv preprint arXiv:2201.13433 (2022)
Bau, D., Strobelt, H., Peebles, W., Zhou, B., Zhu, J.Y., Torralba, A., et al.: Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727 (2020)
Bonneel, N., Tompkin, J., Sunkavalli, K., Sun, D., Paris, S., Pfister, H.: Blind video temporal consistency. ACM TOG 34(6), 1–9 (2015)
Article Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis (2019)
Google Scholar
Chai, L., Wulff, J., Isola, P.: Using latent space regression to analyze and leverage compositionality in gans. In: International Conference on Learning Representations (2021)
Google Scholar
Chen, D., Liao, J., Yuan, L., Yu, N., Hua, G.: Coherent online video style transfer. In: ICCV (2017)
Google Scholar
Collins, E., Bala, R., Price, B., Susstrunk, S.: Editing in style: uncovering the local semantics of gans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5771–5780 (2020)
Google Scholar
Daras, G., Odena, A., Zhang, H., Dimakis, A.G.: Your local gan: designing two dimensional local attention mechanisms for generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14531–14539 (2020)
Google Scholar
Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: Stylegan-nada: clip-guided domain adaptation of image generators (2021)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Google Scholar
Gu, J., Shen, Y., Zhou, B.: Image processing using multi-code gan prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3012–3021 (2020)
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans (2017)
Google Scholar
Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z.: Towards fast, accurate and stable 3D dense face alignment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 152–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_10
Chapter Google Scholar
Huang, J.B., Kang, S.B., Ahuja, N., Kopf, J.: Temporally coherent completion of dynamic video. ACM TOG 35(6), 1–11 (2016)
Google Scholar
Huh, M., Zhang, R., Zhu, J.-Y., Paris, S., Hertzmann, A.: Transforming and projecting images into class-conditional generative networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 17–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_2
Chapter Google Scholar
Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: Ganspace: discovering interpretable gan controls. In: Proceeding NeurIPS (2020)
Google Scholar
Jang, W., Ju, G., Jung, Y., Yang, J., Tong, X., Lee, S.: Stylecarigan: caricature generation via stylegan feature map modulation. ACM Trans. Graph. (TOG) 40(4), 1–16 (2021)
Article Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. In: International Conference on Learning Representations (2018)
Google Scholar
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676 (2020)
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. In: Proceeding NeurIPS (2020)
Google Scholar
Karras, T., et al.: Alias-free generative adversarial networks. In: Proceeding NeurIPS (2021)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR (2020)
Google Scholar
Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. In: ACM TOG (2021)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012)
Google Scholar
Kwong, S., Huang, J., Liao, J.: Unsupervised image-to-image translation via pre-trained stylegan2 network. IEEE Trans. Multimedia 24, 1435–1448 (2021)
Google Scholar
Lai, W.S., Huang, J.B., Wang, O., Shechtman, E., Yumer, E., Yang, M.H.: Learning blind video temporal consistency. In: ECCV (2018)
Google Scholar
Lai, W.S., Huang, J.B., Wang, O., Shechtman, E., Yumer, E., Yang, M.H.: Learning blind video temporal consistency. In: European Conference on Computer Vision (2018)D
Google Scholar
Lei, C., Xing, Y., Chen, Q.: Blind video temporal consistency via deep video prior. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Li, B., et al.: Dystyle: dynamic neural network for multi-attribute-conditioned style editing. arXiv preprint arXiv:2109.10737 (2021)
Liu, Y.L., Lai, W.S., Yang, M.H., Chuang, Y.Y., Huang, J.B.: Learning to see through obstructions. In: CVPR (2020)
Google Scholar
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391 (2018)
Google Scholar
Luo, J., Xu, Y., Tang, C., Lv, J.: Learning inverse mapping by autoencoder based generative adversarial nets. In: International Conference on Neural Information Processing, pp. 207–216. Springer (2017). https://doi.org/10.1007/978-3-319-70096-0_22
Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802 (2017)
Google Scholar
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks (2018)
Google Scholar
Nitzan, Y., Bermano, A., Li, Y., Cohen-Or, D.: Face identity disentanglement via latent space mapping. ACM Trans. Graph. (TOG) 39, 1–14 (2020)
Article Google Scholar
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2085–2094 (2021)
Google Scholar
Raj, A., Li, Y., Bresler, Y.: Gan-based projector for faster recovery with convergence guarantees in linear inverse problems. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5602–5611 (2019)
Google Scholar
Rav-Acha, A., Kohli, P., Rother, C., Fitzgibbon, A.: Unwrap mosaics: a new representation for video editing. In: ACM TOG (2008)
Google Scholar
Rho, D., Cho, J., Ko, J.H., Park, E.: Neural residual flow fields for efficient video representations. arXiv preprint arXiv:2201.04329 (2022)
Richardson, E., et al.: Encoding in style: a stylegan encoder for image-to-image translation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Roich, D., Mokady, R., Bermano, A.H., Cohen-Or, D.: Pivotal tuning for latent-based editing of real images. arXiv preprint arXiv:2106.05744 (2021)
Saha, R., Duke, B., Shkurti, F., Taylor, G.W., Aarabi, P.: Loho: latent optimization of hairstyles via orthogonalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1984–1993 (2021)
Google Scholar
Shen, Y., Yang, C., Tang, X., Zhou, B.: Interfacegan: interpreting the disentangled face representation learned by gans. In: TPAMI (2020)
Google Scholar
Shen, Y., Zhou, B.: Closed-form factorization of latent semantics in gans. In: CVPR (2021)
Google Scholar
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. arXiv preprint arXiv:2003.12039 (2020)
Tewari, A., et al.: Pie: portrait image embedding for semantic control. arXiv preprint arXiv:2009.09485 (2020)
Tewari, A., et al.: Stylerig: rigging stylegan for 3d control over portrait images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6142–6151 (2020)
Google Scholar
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for stylegan image manipulation. ACM Trans. Graph. (TOG) 40(4), 1–14 (2021)
Article Google Scholar
Tzaban, R., Mokady, R., Gal, R., Bermano, A.H., Cohen-Or, D.: Stitch it in time: gan-based facial editing of real videos. arXiv preprint arXiv:2201.08361 (2022)
Viazovetskyi, Y., Ivashkin, V., Kashin, E.: StyleGAN2 distillation for feed-forward image manipulation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 170–186. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_11
Chapter Google Scholar
Wu, Y., Yang, Y.L., Xiao, Q., Jin, X.: Coarse-to-fine: facial structure editing of portrait images via latent space classifications. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)
Article Google Scholar
Wu, Z., Lischinski, D., Shechtman, E.: Stylespace analysis: disentangled controls for stylegan image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12863–12872 (2021)
Google Scholar
Xia, W., Zhang, Y., Yang, Y., Xue, J.H., Zhou, B., Yang, M.H.: Gan inversion: a survey. arXiv preprint arXiv:2101.05278 (2021)
Yao, X., Newson, A., Gousseau, Y., Hellier, P.: A latent transformer for disentangled face editing in images and videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13789–13798 (2021)
Google Scholar
Yüksel, O.K., Simsar, E., Er, E.G., Yanardag, P.: Latentclr: a contrastive learning approach for unsupervised discovery of interpretable directions. arXiv preprint arXiv:2104.00820 (2021)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Google Scholar
Zhu, J., Shen, Y., Zhao, D., Zhou, B.: In-Domain GAN inversion for real image editing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 592–608. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_35
Chapter Google Scholar
Zhu, J.-Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 597–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_36
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Maryland, College Park, USA
Yiran Xu & Jia-Bin Huang
Virginia Tech, Blacksburg, USA
Badour AlBahar

Authors

Yiran Xu
View author publications
You can also search for this author in PubMed Google Scholar
Badour AlBahar
View author publications
You can also search for this author in PubMed Google Scholar
Jia-Bin Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiran Xu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3391 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, Y., AlBahar, B., Huang, JB. (2022). Temporally Consistent Semantic Video Editing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13675. Springer, Cham. https://doi.org/10.1007/978-3-031-19784-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-19784-0_21
Published: 31 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19783-3
Online ISBN: 978-3-031-19784-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics