Skip to main content

JoJoGAN: One Shot Face Stylization

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

Abstract

A style mapper applies some fixed style to its input images (so, for example, taking faces to cartoons). This paper describes a simple procedure – JoJoGAN – to learn a style mapper from a single example of the style. JoJoGAN uses a GAN inversion procedure and StyleGAN’s style-mixing property to produce a substantial paired dataset from a single example style. The paired dataset is then used to fine-tune a StyleGAN. An image can then be style mapped by GAN-inversion followed by the fine-tuned StyleGAN. JoJoGAN needs just one reference and as little as 30 s of training time. JoJoGAN can use extreme style references (say, animal faces) successfully. Furthermore, one can control what aspects of the style are used and how much of the style is applied. Qualitative and quantitative evaluation show that JoJoGAN produces high quality high resolution images that vastly outperform the current state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alaluf, Y., Patashnik, O., Cohen-Or, D.: Restyle: a residual-based StyleGAN encoder via iterative refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021

    Google Scholar 

  2. Chong, M.J., Chu, W.S., Kumar, A., Forsyth, D.: Retrieve in style: unsupervised facial feature transfer and retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3887–3896, October 2021

    Google Scholar 

  3. Chong, M.J., Forsyth, D.: Effectively unbiased fid and inception score and where to find them. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6070–6079 (2020)

    Google Scholar 

  4. Chong, M.J., Forsyth, D.: GANs N’ roses: stable, controllable, diverse image to image translation (works for videos too!) (2021)

    Google Scholar 

  5. Chong, M.J., Lee, H.Y., Forsyth, D.: StyleGAN of all trades: image manipulation with only pretrained stylegan. arXiv preprint arXiv:2111.01619 (2021)

  6. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)

    Google Scholar 

  7. Efros, A.A., Freeman, W.T.: Image quilting for texture synthesis and transfer. In: Proceedings of SIGGRAPH 2001, pp. 341–346 (2001)

    Google Scholar 

  8. Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-nada: clip-guided domain adaptation of image generators (2021)

    Google Scholar 

  9. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

    Google Scholar 

  10. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Proceedings of SIGGRAPH 2001, pp. 327–340 (2001)

    Google Scholar 

  11. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  12. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)

    Google Scholar 

  13. Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: ECCV (2018)

    Google Scholar 

  14. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (2016)

    Google Scholar 

  15. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)

    Google Scholar 

  16. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proceedings of CVPR (2020)

    Google Scholar 

  17. Kim, S.S.Y., Kolkin, N., Salavon, J., Shakhnarovich, G.: Deformable style transfer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 246–261. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_15

    Chapter  Google Scholar 

  18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)

    Google Scholar 

  19. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  20. Li, Y., Zhang, R., Lu, J.C., Shechtman, E.: Few-shot image generation with elastic weight consolidation. In: Advances in Neural Information Processing Systems (2020)

    Google Scholar 

  21. Liu, M., Li, Q., Qin, Z., Zhang, G., Wan, P., Zheng, W.: BlendGAN: implicitly GAN blending for arbitrary stylized face generation. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  22. Menon, S., Damian, A., Hu, S., Ravi, N., Rudin, C.: Pulse: self-supervised photo upsampling via latent space exploration of generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2445 (2020)

    Google Scholar 

  23. Mo, S., Cho, M., Shin, J.: Freeze the discriminator: a simple baseline for fine-tuning GANs. In: CVPR AI for Content Creation Workshop (2020)

    Google Scholar 

  24. Ojha, U., et al.: Few-shot image generation via cross-domain correspondence. In: CVPR (2021)

    Google Scholar 

  25. Ojha, U., et al.: Few-shot image generation via cross-domain correspondence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10743–10752 (2021)

    Google Scholar 

  26. Park, D.Y., Lee, K.H.: Arbitrary style transfer with style-attentional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5880–5888 (2019)

    Google Scholar 

  27. Pinkney, J.N., Adler, D.: Resolution dependent GAN interpolation for controllable image synthesis between domains. arXiv preprint arXiv:2010.05334 (2020)

  28. Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)

  29. Robb, E., Chu, W.S., Kumar, A., Huang, J.B.: Few-shot adaptation of generative adversarial networks. arXiv preprint arXiv:2010.11943 (2020)

  30. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. Adv. Neural. Inf. Process. Syst. 29, 2234–2242 (2016)

    Google Scholar 

  31. Shen, Y., Yang, C., Tang, X., Zhou, B.: InterFaceGAN: interpreting the disentangled face representation learned by GANs. IEEE Trans. Pattern Anal. Mach. Intell. (2020)

    Google Scholar 

  32. Shen, Y., Zhou, B.: Closed-form factorization of latent semantics in GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1532–1540 (2021)

    Google Scholar 

  33. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

    Google Scholar 

  34. Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for StyleGAN image manipulation. arXiv preprint arXiv:2102.02766 (2021)

  35. Wang, X., Li, Y., Zhang, H., Shan, Y.: Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9168–9178 (2021)

    Google Scholar 

  36. Wu, Z., Lischinski, D., Shechtman, E.: Stylespace analysis: disentangled controls for StyleGAN image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12863–12872 (2021)

    Google Scholar 

  37. Yang, T., Ren, P., Xie, X., Zhang, L.: GAN prior embedded network for blind face restoration in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  38. Yeh, M.C., Tang, S., Bhattad, A., Zou, C., Forsyth, D.: Improving style transfer with calibrated metrics. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020

    Google Scholar 

  39. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  40. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  41. Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Mind the gap: domain gap control for single shot domain adaptation for generative adversarial networks. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=vqGi8Kp0wM

  42. Zhu, P., Abdal, R., Qin, Y., Femiani, J., Wonka, P.: Improved StyleGAN embedding: where are the good latents? arXiv preprint arXiv:2012.09036 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Jin Chong .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3656 KB)

A Appendix

A Appendix

1.1 A.1 Choice of GAN Inversion

Fig. 14.
figure 14

The choice of GAN inversion matters. We compare JoJoGAN trained on e4e [34], II2S [42], and ReStyle [1] inversions. II2S gives the most realistic inversions leading to stylizations that preserves shapes and proportions of the reference. ReStyle gives the most accurate reconstruction leading to stylization that better preserves the features and proportions of the input.

JoJoGAN relies on GAN inversion to create a paired dataset. We investigate the effect of using 3 different GAN inversion methods, e4e [34], II2S [42], and ReStyle [1] in Fig. 14.

Using e4e fails to accurately recreate the style reference and conveniently gives us a corresponding real face. On the other hand, ReStyle more accurately inverts the reference, giving a non-realistic face. II2S is a gradient-descent based method with a regularization term that allows us to map the style code to a higher density region in the latent space. The regularization term results in very realistic faces that are somewhat inaccurate to the reference.

The different inversions give us different JoJoGAN results. Training with ReStyle leads to clean stylization that accurately preserves the features and proportions of the input face. Training with II2S on the other hand leads to heavy stylization that borrows the shapes and proportions from the reference. However, this also leads to pretty heavy semantic changes from the input face and artifacts (note the change of identity and artifacts along the neck).

Fig. 15.
figure 15

The choice of M matters. M controls the blend between the inverted style with the mean style. M1 is the closest to the reference, leading to smaller features (e.g., eyes). M3 is the closest to a real face, leading to exaggerated features more like reference and also significant artifacts.

In practice, we blend the style codes from ReStyle and the mean face. For M, we borrow the style code from the mean face at layers 7, 9, and 11. This borrows the facial features of the mean face to the inversion. However, it is impossible to only affect the proportions of the features by simply blending coarsely at a layer level. For example, naively blending the mean face can change the expression of the inversion, e.g. from neutral to smiling or introduce artifacts. We thus have to blend at a finer scale, which we are able to do so by isolating specific facial features in the style space using RIS [2]. Figure 15 compares the results of using different M for blending. Note that when the blended image is more face-like (M3), the exaggerated features of the reference is transferred. However, significant artifacts are introduced, see M3 row 2. By carefully selecting M, we can transfer the exaggerated features while avoiding artifacts, see M2.

1.2 A.2 Identity Loss

Before computing identity loss, we grayscale the input images to prevent the identity loss from affecting the colors. The weight of the identity loss is reference-dependent, but we typically choose between \(2 \times 10^3\) to \(5 \times 10^3\).

1.3 A.3 Choice of Style Mixing Space

Fig. 16.
figure 16

We study how the choice of latent space to do style mixing affects JoJoGAN. Style mixing in \(\mathcal {S}\) space gives more accurate color reproduction in (a) and (b) and better stylization effect (note the eyes) in (c).

Style mixing in Eq. (1) allows us to generate more paired datapoints. It is reasonable to map faces with slight differences in textures and colors to the same reference. As such it is pertinent that while we style mix to generate different faces, we need certain features such as identity, face pose, etc. to remain the same. We study how the choice of latent space to do style mixing affects the stylization. In Fig. 16 we see that style mixing in \(\mathcal {S}\) gives better color reproduction and overall stylization effect. This is because \(\mathcal {S}\) is more disentangled [36] and allows us to more aggressively style mix without changing the features we want intact.

Fig. 17.
figure 17

The choice of training data has an effect. First row: when there is just one example in \(\mathcal {W}\), JoJoGAN transfers relatively little style, likely because it is trained to map “few” images to the stylized example. Second row: same training procedure as in Fig. 8, using \(\mathcal {C}\). Third row: same training procedure as second row but with grayscale images for Eq. (2). Fourth row: same training procedure as in Fig. 8, using \(\mathcal {X}\). Fifth row: same training procedure as Fourth row but with grayscale images for Eq. (2).

1.4 A.4 Varying Dataset

Using \(\mathcal {C}\) and \(\mathcal {X}\) gives different stylization effects. Finetuning with \(\mathcal {X}\) accurately reproduces the color profile of the reference while \(\mathcal {C}\) tries to preserve the input color profile. However, this is insufficient to fully preserve the colors as we see in Fig. 17. Grayscaling the images before computing the loss in Eq. (2) in addition to finetuning with \(\mathcal {C}\) gives us stylization effects without altering the color profile. We show that it is necessary to use both \(\mathcal {C}\) and grayscaling to achieve this effect and using \(\mathcal {X}\) and grayscaling is insufficient.

1.5 A.5 Feature Matching Loss

For discriminator feature matching loss, we compute the intermediate activations after resblock 2, 4, 5, 6 (Figs. 18, 20 and 21).

Fig. 18.
figure 18

More multi-shot examples

Fig. 19.
figure 19

JoJoGAN produces unsatisfactory style transfers on OOD cases, producing human-animal hybrids.

Fig. 20.
figure 20

We compare with Zhu et al.  [41] on all examples for references used in their paper and described as hard cases there. For each reference, the top row is JoJoGAN while the second row is Zhu et al. Note how their method distorts chin shape, while JoJoGAN produces strong outputs.

Fig. 21.
figure 21

JoJoGAN is a method to benefit from what a StyleGAN knows, and so should apply to other domains where a well-trained StyleGAN is available. Here we demonstrate JoJoGAN applied to LSUN-Churches.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chong, M.J., Forsyth, D. (2022). JoJoGAN: One Shot Face Stylization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19787-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19786-4

  • Online ISBN: 978-3-031-19787-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics