Abstract
Generative adversarial networks (GANs) have demonstrated impressive image generation quality and semantic editing capability of real images, e.g., changing object classes, modifying attributes, or transferring styles. However, applying these GAN-based editing to a video independently for each frame inevitably results in temporal flickering artifacts. We present a simple yet effective method to facilitate temporally coherent video editing. Our core idea is to minimize the temporal photometric inconsistency by optimizing both the latent code and the pre-trained generator. We evaluate the quality of our editing on different domains and GAN inversion techniques and show favorable results against the baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: how to embed images into the stylegan latent space? In: ICCV (2019)
Abdal, R., Qin, Y., Wonka, P.: Image2stylegan++: how to edit the embedded images? In: CVPR (2020)
Abdal, R., Zhu, P., Mitra, N.J., Wonka, P.: Styleflow: attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Trans. Graph. (TOG) 40(3), 1–21 (2021)
Afifi, M., Brubaker, M.A., Brown, M.S.: Histogan: controlling colors of gan-generated and real images via color histograms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Alaluf, Y., Patashnik, O., Cohen-Or, D.: Only a matter of style: age transformation using a style-based regression model. arXiv preprint arXiv:2102.02754 (2021)
Alaluf, Y., Patashnik, O., Cohen-Or, D.: Restyle: a residual-based stylegan encoder via iterative refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Alaluf, Y., et al.: Third time’s the charm? image and video editing with stylegan3. arXiv preprint arXiv:2201.13433 (2022)
Bau, D., Strobelt, H., Peebles, W., Zhou, B., Zhu, J.Y., Torralba, A., et al.: Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727 (2020)
Bonneel, N., Tompkin, J., Sunkavalli, K., Sun, D., Paris, S., Pfister, H.: Blind video temporal consistency. ACM TOG 34(6), 1–9 (2015)
Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis (2019)
Chai, L., Wulff, J., Isola, P.: Using latent space regression to analyze and leverage compositionality in gans. In: International Conference on Learning Representations (2021)
Chen, D., Liao, J., Yuan, L., Yu, N., Hua, G.: Coherent online video style transfer. In: ICCV (2017)
Collins, E., Bala, R., Price, B., Susstrunk, S.: Editing in style: uncovering the local semantics of gans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5771–5780 (2020)
Daras, G., Odena, A., Zhang, H., Dimakis, A.G.: Your local gan: designing two dimensional local attention mechanisms for generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14531–14539 (2020)
Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: Stylegan-nada: clip-guided domain adaptation of image generators (2021)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Gu, J., Shen, Y., Zhou, B.: Image processing using multi-code gan prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3012–3021 (2020)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans (2017)
Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z.: Towards fast, accurate and stable 3D dense face alignment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 152–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_10
Huang, J.B., Kang, S.B., Ahuja, N., Kopf, J.: Temporally coherent completion of dynamic video. ACM TOG 35(6), 1–11 (2016)
Huh, M., Zhang, R., Zhu, J.-Y., Paris, S., Hertzmann, A.: Transforming and projecting images into class-conditional generative networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 17–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_2
Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: Ganspace: discovering interpretable gan controls. In: Proceeding NeurIPS (2020)
Jang, W., Ju, G., Jung, Y., Yang, J., Tong, X., Lee, S.: Stylecarigan: caricature generation via stylegan feature map modulation. ACM Trans. Graph. (TOG) 40(4), 1–16 (2021)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. In: International Conference on Learning Representations (2018)
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676 (2020)
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. In: Proceeding NeurIPS (2020)
Karras, T., et al.: Alias-free generative adversarial networks. In: Proceeding NeurIPS (2021)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR (2020)
Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. In: ACM TOG (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012)
Kwong, S., Huang, J., Liao, J.: Unsupervised image-to-image translation via pre-trained stylegan2 network. IEEE Trans. Multimedia 24, 1435–1448 (2021)
Lai, W.S., Huang, J.B., Wang, O., Shechtman, E., Yumer, E., Yang, M.H.: Learning blind video temporal consistency. In: ECCV (2018)
Lai, W.S., Huang, J.B., Wang, O., Shechtman, E., Yumer, E., Yang, M.H.: Learning blind video temporal consistency. In: European Conference on Computer Vision (2018)D
Lei, C., Xing, Y., Chen, Q.: Blind video temporal consistency via deep video prior. In: Advances in Neural Information Processing Systems (2020)
Li, B., et al.: Dystyle: dynamic neural network for multi-attribute-conditioned style editing. arXiv preprint arXiv:2109.10737 (2021)
Liu, Y.L., Lai, W.S., Yang, M.H., Chuang, Y.Y., Huang, J.B.: Learning to see through obstructions. In: CVPR (2020)
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391 (2018)
Luo, J., Xu, Y., Tang, C., Lv, J.: Learning inverse mapping by autoencoder based generative adversarial nets. In: International Conference on Neural Information Processing, pp. 207–216. Springer (2017). https://doi.org/10.1007/978-3-319-70096-0_22
Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802 (2017)
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks (2018)
Nitzan, Y., Bermano, A., Li, Y., Cohen-Or, D.: Face identity disentanglement via latent space mapping. ACM Trans. Graph. (TOG) 39, 1–14 (2020)
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2085–2094 (2021)
Raj, A., Li, Y., Bresler, Y.: Gan-based projector for faster recovery with convergence guarantees in linear inverse problems. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5602–5611 (2019)
Rav-Acha, A., Kohli, P., Rother, C., Fitzgibbon, A.: Unwrap mosaics: a new representation for video editing. In: ACM TOG (2008)
Rho, D., Cho, J., Ko, J.H., Park, E.: Neural residual flow fields for efficient video representations. arXiv preprint arXiv:2201.04329 (2022)
Richardson, E., et al.: Encoding in style: a stylegan encoder for image-to-image translation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Roich, D., Mokady, R., Bermano, A.H., Cohen-Or, D.: Pivotal tuning for latent-based editing of real images. arXiv preprint arXiv:2106.05744 (2021)
Saha, R., Duke, B., Shkurti, F., Taylor, G.W., Aarabi, P.: Loho: latent optimization of hairstyles via orthogonalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1984–1993 (2021)
Shen, Y., Yang, C., Tang, X., Zhou, B.: Interfacegan: interpreting the disentangled face representation learned by gans. In: TPAMI (2020)
Shen, Y., Zhou, B.: Closed-form factorization of latent semantics in gans. In: CVPR (2021)
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. arXiv preprint arXiv:2003.12039 (2020)
Tewari, A., et al.: Pie: portrait image embedding for semantic control. arXiv preprint arXiv:2009.09485 (2020)
Tewari, A., et al.: Stylerig: rigging stylegan for 3d control over portrait images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6142–6151 (2020)
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for stylegan image manipulation. ACM Trans. Graph. (TOG) 40(4), 1–14 (2021)
Tzaban, R., Mokady, R., Gal, R., Bermano, A.H., Cohen-Or, D.: Stitch it in time: gan-based facial editing of real videos. arXiv preprint arXiv:2201.08361 (2022)
Viazovetskyi, Y., Ivashkin, V., Kashin, E.: StyleGAN2 distillation for feed-forward image manipulation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 170–186. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_11
Wu, Y., Yang, Y.L., Xiao, Q., Jin, X.: Coarse-to-fine: facial structure editing of portrait images via latent space classifications. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)
Wu, Z., Lischinski, D., Shechtman, E.: Stylespace analysis: disentangled controls for stylegan image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12863–12872 (2021)
Xia, W., Zhang, Y., Yang, Y., Xue, J.H., Zhou, B., Yang, M.H.: Gan inversion: a survey. arXiv preprint arXiv:2101.05278 (2021)
Yao, X., Newson, A., Gousseau, Y., Hellier, P.: A latent transformer for disentangled face editing in images and videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13789–13798 (2021)
Yüksel, O.K., Simsar, E., Er, E.G., Yanardag, P.: Latentclr: a contrastive learning approach for unsupervised discovery of interpretable directions. arXiv preprint arXiv:2104.00820 (2021)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Zhu, J., Shen, Y., Zhao, D., Zhou, B.: In-Domain GAN inversion for real image editing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 592–608. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_35
Zhu, J.-Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 597–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_36
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, Y., AlBahar, B., Huang, JB. (2022). Temporally Consistent Semantic Video Editing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13675. Springer, Cham. https://doi.org/10.1007/978-3-031-19784-0_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-19784-0_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19783-3
Online ISBN: 978-3-031-19784-0
eBook Packages: Computer ScienceComputer Science (R0)