JoJoGAN: One Shot Face Stylization

Chong, Min Jin; Forsyth, David

doi:10.1007/978-3-031-19787-1_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

European Conference on Computer Vision

3248 Accesses
31 Citations

Abstract

A style mapper applies some fixed style to its input images (so, for example, taking faces to cartoons). This paper describes a simple procedure – JoJoGAN – to learn a style mapper from a single example of the style. JoJoGAN uses a GAN inversion procedure and StyleGAN’s style-mixing property to produce a substantial paired dataset from a single example style. The paired dataset is then used to fine-tune a StyleGAN. An image can then be style mapped by GAN-inversion followed by the fine-tuned StyleGAN. JoJoGAN needs just one reference and as little as 30 s of training time. JoJoGAN can use extreme style references (say, animal faces) successfully. Furthermore, one can control what aspects of the style are used and how much of the style is applied. Qualitative and quantitative evaluation show that JoJoGAN produces high quality high resolution images that vastly outperform the current state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Deformable Style Transfer

HyperNST: Hyper-Networks for Neural Style Transfer

Implicit Style-Content Separation Using B-LoRA

References

Alaluf, Y., Patashnik, O., Cohen-Or, D.: Restyle: a residual-based StyleGAN encoder via iterative refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021
Google Scholar
Chong, M.J., Chu, W.S., Kumar, A., Forsyth, D.: Retrieve in style: unsupervised facial feature transfer and retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3887–3896, October 2021
Google Scholar
Chong, M.J., Forsyth, D.: Effectively unbiased fid and inception score and where to find them. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6070–6079 (2020)
Google Scholar
Chong, M.J., Forsyth, D.: GANs N’ roses: stable, controllable, diverse image to image translation (works for videos too!) (2021)
Google Scholar
Chong, M.J., Lee, H.Y., Forsyth, D.: StyleGAN of all trades: image manipulation with only pretrained stylegan. arXiv preprint arXiv:2111.01619 (2021)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Google Scholar
Efros, A.A., Freeman, W.T.: Image quilting for texture synthesis and transfer. In: Proceedings of SIGGRAPH 2001, pp. 341–346 (2001)
Google Scholar
Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-nada: clip-guided domain adaptation of image generators (2021)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Proceedings of SIGGRAPH 2001, pp. 327–340 (2001)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Google Scholar
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: ECCV (2018)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (2016)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proceedings of CVPR (2020)
Google Scholar
Kim, S.S.Y., Kolkin, N., Salavon, J., Shakhnarovich, G.: Deformable style transfer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 246–261. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_15
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Google Scholar
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Li, Y., Zhang, R., Lu, J.C., Shechtman, E.: Few-shot image generation with elastic weight consolidation. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Liu, M., Li, Q., Qin, Z., Zhang, G., Wan, P., Zheng, W.: BlendGAN: implicitly GAN blending for arbitrary stylized face generation. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Menon, S., Damian, A., Hu, S., Ravi, N., Rudin, C.: Pulse: self-supervised photo upsampling via latent space exploration of generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2445 (2020)
Google Scholar
Mo, S., Cho, M., Shin, J.: Freeze the discriminator: a simple baseline for fine-tuning GANs. In: CVPR AI for Content Creation Workshop (2020)
Google Scholar
Ojha, U., et al.: Few-shot image generation via cross-domain correspondence. In: CVPR (2021)
Google Scholar
Ojha, U., et al.: Few-shot image generation via cross-domain correspondence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10743–10752 (2021)
Google Scholar
Park, D.Y., Lee, K.H.: Arbitrary style transfer with style-attentional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5880–5888 (2019)
Google Scholar
Pinkney, J.N., Adler, D.: Resolution dependent GAN interpolation for controllable image synthesis between domains. arXiv preprint arXiv:2010.05334 (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Robb, E., Chu, W.S., Kumar, A., Huang, J.B.: Few-shot adaptation of generative adversarial networks. arXiv preprint arXiv:2010.11943 (2020)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. Adv. Neural. Inf. Process. Syst. 29, 2234–2242 (2016)
Google Scholar
Shen, Y., Yang, C., Tang, X., Zhou, B.: InterFaceGAN: interpreting the disentangled face representation learned by GANs. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Google Scholar
Shen, Y., Zhou, B.: Closed-form factorization of latent semantics in GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1532–1540 (2021)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Google Scholar
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for StyleGAN image manipulation. arXiv preprint arXiv:2102.02766 (2021)
Wang, X., Li, Y., Zhang, H., Shan, Y.: Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9168–9178 (2021)
Google Scholar
Wu, Z., Lischinski, D., Shechtman, E.: Stylespace analysis: disentangled controls for StyleGAN image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12863–12872 (2021)
Google Scholar
Yang, T., Ren, P., Xie, X., Zhang, L.: GAN prior embedded network for blind face restoration in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Yeh, M.C., Tang, S., Bhattad, A., Zou, C., Forsyth, D.: Improving style transfer with calibrated metrics. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Mind the gap: domain gap control for single shot domain adaptation for generative adversarial networks. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=vqGi8Kp0wM
Zhu, P., Abdal, R., Qin, Y., Femiani, J., Wonka, P.: Improved StyleGAN embedding: where are the good latents? arXiv preprint arXiv:2012.09036 (2020)

Download references

Author information

Authors and Affiliations

University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA
Min Jin Chong & David Forsyth

Authors

Min Jin Chong
View author publications
You can also search for this author in PubMed Google Scholar
David Forsyth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Jin Chong .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3656 KB)

A Appendix

1.1 A.1 Choice of GAN Inversion

JoJoGAN relies on GAN inversion to create a paired dataset. We investigate the effect of using 3 different GAN inversion methods, e4e [34], II2S [42], and ReStyle [1] in Fig. 14.

Using e4e fails to accurately recreate the style reference and conveniently gives us a corresponding real face. On the other hand, ReStyle more accurately inverts the reference, giving a non-realistic face. II2S is a gradient-descent based method with a regularization term that allows us to map the style code to a higher density region in the latent space. The regularization term results in very realistic faces that are somewhat inaccurate to the reference.

The different inversions give us different JoJoGAN results. Training with ReStyle leads to clean stylization that accurately preserves the features and proportions of the input face. Training with II2S on the other hand leads to heavy stylization that borrows the shapes and proportions from the reference. However, this also leads to pretty heavy semantic changes from the input face and artifacts (note the change of identity and artifacts along the neck).

In practice, we blend the style codes from ReStyle and the mean face. For M, we borrow the style code from the mean face at layers 7, 9, and 11. This borrows the facial features of the mean face to the inversion. However, it is impossible to only affect the proportions of the features by simply blending coarsely at a layer level. For example, naively blending the mean face can change the expression of the inversion, e.g. from neutral to smiling or introduce artifacts. We thus have to blend at a finer scale, which we are able to do so by isolating specific facial features in the style space using RIS [2]. Figure 15 compares the results of using different M for blending. Note that when the blended image is more face-like (M3), the exaggerated features of the reference is transferred. However, significant artifacts are introduced, see M3 row 2. By carefully selecting M, we can transfer the exaggerated features while avoiding artifacts, see M2.

1.2 A.2 Identity Loss

Before computing identity loss, we grayscale the input images to prevent the identity loss from affecting the colors. The weight of the identity loss is reference-dependent, but we typically choose between $2 \times 10^3$ to $5 \times 10^3$.

1.3 A.3 Choice of Style Mixing Space

Style mixing in Eq. (1) allows us to generate more paired datapoints. It is reasonable to map faces with slight differences in textures and colors to the same reference. As such it is pertinent that while we style mix to generate different faces, we need certain features such as identity, face pose, etc. to remain the same. We study how the choice of latent space to do style mixing affects the stylization. In Fig. 16 we see that style mixing in $\mathcal {S}$ gives better color reproduction and overall stylization effect. This is because $\mathcal {S}$ is more disentangled [36] and allows us to more aggressively style mix without changing the features we want intact.

1.4 A.4 Varying Dataset

Using $\mathcal {C}$ and $\mathcal {X}$ gives different stylization effects. Finetuning with $\mathcal {X}$ accurately reproduces the color profile of the reference while $\mathcal {C}$ tries to preserve the input color profile. However, this is insufficient to fully preserve the colors as we see in Fig. 17. Grayscaling the images before computing the loss in Eq. (2) in addition to finetuning with $\mathcal {C}$ gives us stylization effects without altering the color profile. We show that it is necessary to use both $\mathcal {C}$ and grayscaling to achieve this effect and using $\mathcal {X}$ and grayscaling is insufficient.

1.5 A.5 Feature Matching Loss

For discriminator feature matching loss, we compute the intermediate activations after resblock 2, 4, 5, 6 (Figs. 18, 20 and 21).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chong, M.J., Forsyth, D. (2022). JoJoGAN: One Shot Face Stylization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-19787-1_8
Published: 21 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19786-4
Online ISBN: 978-3-031-19787-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

JoJoGAN: One Shot Face Stylization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deformable Style Transfer

HyperNST: Hyper-Networks for Neural Style Transfer

Implicit Style-Content Separation Using B-LoRA

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3656 KB)

A Appendix

A Appendix

1.1 A.1 Choice of GAN Inversion

1.2 A.2 Identity Loss

1.3 A.3 Choice of Style Mixing Space

1.4 A.4 Varying Dataset

1.5 A.5 Feature Matching Loss

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us