Abstract
A style mapper applies some fixed style to its input images (so, for example, taking faces to cartoons). This paper describes a simple procedure – JoJoGAN – to learn a style mapper from a single example of the style. JoJoGAN uses a GAN inversion procedure and StyleGAN’s style-mixing property to produce a substantial paired dataset from a single example style. The paired dataset is then used to fine-tune a StyleGAN. An image can then be style mapped by GAN-inversion followed by the fine-tuned StyleGAN. JoJoGAN needs just one reference and as little as 30 s of training time. JoJoGAN can use extreme style references (say, animal faces) successfully. Furthermore, one can control what aspects of the style are used and how much of the style is applied. Qualitative and quantitative evaluation show that JoJoGAN produces high quality high resolution images that vastly outperform the current state-of-the-art.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alaluf, Y., Patashnik, O., Cohen-Or, D.: Restyle: a residual-based StyleGAN encoder via iterative refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021
Chong, M.J., Chu, W.S., Kumar, A., Forsyth, D.: Retrieve in style: unsupervised facial feature transfer and retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3887–3896, October 2021
Chong, M.J., Forsyth, D.: Effectively unbiased fid and inception score and where to find them. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6070–6079 (2020)
Chong, M.J., Forsyth, D.: GANs N’ roses: stable, controllable, diverse image to image translation (works for videos too!) (2021)
Chong, M.J., Lee, H.Y., Forsyth, D.: StyleGAN of all trades: image manipulation with only pretrained stylegan. arXiv preprint arXiv:2111.01619 (2021)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Efros, A.A., Freeman, W.T.: Image quilting for texture synthesis and transfer. In: Proceedings of SIGGRAPH 2001, pp. 341–346 (2001)
Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-nada: clip-guided domain adaptation of image generators (2021)
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Proceedings of SIGGRAPH 2001, pp. 327–340 (2001)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: ECCV (2018)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (2016)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proceedings of CVPR (2020)
Kim, S.S.Y., Kolkin, N., Salavon, J., Shakhnarovich, G.: Deformable style transfer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 246–261. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_15
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. In: Advances in Neural Information Processing Systems (2017)
Li, Y., Zhang, R., Lu, J.C., Shechtman, E.: Few-shot image generation with elastic weight consolidation. In: Advances in Neural Information Processing Systems (2020)
Liu, M., Li, Q., Qin, Z., Zhang, G., Wan, P., Zheng, W.: BlendGAN: implicitly GAN blending for arbitrary stylized face generation. In: Advances in Neural Information Processing Systems (2021)
Menon, S., Damian, A., Hu, S., Ravi, N., Rudin, C.: Pulse: self-supervised photo upsampling via latent space exploration of generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2445 (2020)
Mo, S., Cho, M., Shin, J.: Freeze the discriminator: a simple baseline for fine-tuning GANs. In: CVPR AI for Content Creation Workshop (2020)
Ojha, U., et al.: Few-shot image generation via cross-domain correspondence. In: CVPR (2021)
Ojha, U., et al.: Few-shot image generation via cross-domain correspondence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10743–10752 (2021)
Park, D.Y., Lee, K.H.: Arbitrary style transfer with style-attentional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5880–5888 (2019)
Pinkney, J.N., Adler, D.: Resolution dependent GAN interpolation for controllable image synthesis between domains. arXiv preprint arXiv:2010.05334 (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Robb, E., Chu, W.S., Kumar, A., Huang, J.B.: Few-shot adaptation of generative adversarial networks. arXiv preprint arXiv:2010.11943 (2020)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. Adv. Neural. Inf. Process. Syst. 29, 2234–2242 (2016)
Shen, Y., Yang, C., Tang, X., Zhou, B.: InterFaceGAN: interpreting the disentangled face representation learned by GANs. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Shen, Y., Zhou, B.: Closed-form factorization of latent semantics in GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1532–1540 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for StyleGAN image manipulation. arXiv preprint arXiv:2102.02766 (2021)
Wang, X., Li, Y., Zhang, H., Shan, Y.: Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9168–9178 (2021)
Wu, Z., Lischinski, D., Shechtman, E.: Stylespace analysis: disentangled controls for StyleGAN image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12863–12872 (2021)
Yang, T., Ren, P., Xie, X., Zhang, L.: GAN prior embedded network for blind face restoration in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Yeh, M.C., Tang, S., Bhattad, A., Zou, C., Forsyth, D.: Improving style transfer with calibrated metrics. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Mind the gap: domain gap control for single shot domain adaptation for generative adversarial networks. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=vqGi8Kp0wM
Zhu, P., Abdal, R., Qin, Y., Femiani, J., Wonka, P.: Improved StyleGAN embedding: where are the good latents? arXiv preprint arXiv:2012.09036 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
A Appendix
A Appendix
1.1 A.1 Choice of GAN Inversion
The choice of GAN inversion matters. We compare JoJoGAN trained on e4e [34], II2S [42], and ReStyle [1] inversions. II2S gives the most realistic inversions leading to stylizations that preserves shapes and proportions of the reference. ReStyle gives the most accurate reconstruction leading to stylization that better preserves the features and proportions of the input.
JoJoGAN relies on GAN inversion to create a paired dataset. We investigate the effect of using 3 different GAN inversion methods, e4e [34], II2S [42], and ReStyle [1] in Fig. 14.
Using e4e fails to accurately recreate the style reference and conveniently gives us a corresponding real face. On the other hand, ReStyle more accurately inverts the reference, giving a non-realistic face. II2S is a gradient-descent based method with a regularization term that allows us to map the style code to a higher density region in the latent space. The regularization term results in very realistic faces that are somewhat inaccurate to the reference.
The different inversions give us different JoJoGAN results. Training with ReStyle leads to clean stylization that accurately preserves the features and proportions of the input face. Training with II2S on the other hand leads to heavy stylization that borrows the shapes and proportions from the reference. However, this also leads to pretty heavy semantic changes from the input face and artifacts (note the change of identity and artifacts along the neck).
In practice, we blend the style codes from ReStyle and the mean face. For M, we borrow the style code from the mean face at layers 7, 9, and 11. This borrows the facial features of the mean face to the inversion. However, it is impossible to only affect the proportions of the features by simply blending coarsely at a layer level. For example, naively blending the mean face can change the expression of the inversion, e.g. from neutral to smiling or introduce artifacts. We thus have to blend at a finer scale, which we are able to do so by isolating specific facial features in the style space using RIS [2]. Figure 15 compares the results of using different M for blending. Note that when the blended image is more face-like (M3), the exaggerated features of the reference is transferred. However, significant artifacts are introduced, see M3 row 2. By carefully selecting M, we can transfer the exaggerated features while avoiding artifacts, see M2.
1.2 A.2 Identity Loss
Before computing identity loss, we grayscale the input images to prevent the identity loss from affecting the colors. The weight of the identity loss is reference-dependent, but we typically choose between \(2 \times 10^3\) to \(5 \times 10^3\).
1.3 A.3 Choice of Style Mixing Space
Style mixing in Eq. (1) allows us to generate more paired datapoints. It is reasonable to map faces with slight differences in textures and colors to the same reference. As such it is pertinent that while we style mix to generate different faces, we need certain features such as identity, face pose, etc. to remain the same. We study how the choice of latent space to do style mixing affects the stylization. In Fig. 16 we see that style mixing in \(\mathcal {S}\) gives better color reproduction and overall stylization effect. This is because \(\mathcal {S}\) is more disentangled [36] and allows us to more aggressively style mix without changing the features we want intact.
The choice of training data has an effect. First row: when there is just one example in \(\mathcal {W}\), JoJoGAN transfers relatively little style, likely because it is trained to map “few” images to the stylized example. Second row: same training procedure as in Fig. 8, using \(\mathcal {C}\). Third row: same training procedure as second row but with grayscale images for Eq. (2). Fourth row: same training procedure as in Fig. 8, using \(\mathcal {X}\). Fifth row: same training procedure as Fourth row but with grayscale images for Eq. (2).
1.4 A.4 Varying Dataset
Using \(\mathcal {C}\) and \(\mathcal {X}\) gives different stylization effects. Finetuning with \(\mathcal {X}\) accurately reproduces the color profile of the reference while \(\mathcal {C}\) tries to preserve the input color profile. However, this is insufficient to fully preserve the colors as we see in Fig. 17. Grayscaling the images before computing the loss in Eq. (2) in addition to finetuning with \(\mathcal {C}\) gives us stylization effects without altering the color profile. We show that it is necessary to use both \(\mathcal {C}\) and grayscaling to achieve this effect and using \(\mathcal {X}\) and grayscaling is insufficient.
1.5 A.5 Feature Matching Loss
For discriminator feature matching loss, we compute the intermediate activations after resblock 2, 4, 5, 6 (Figs. 18, 20 and 21).
We compare with Zhu et al. [41] on all examples for references used in their paper and described as hard cases there. For each reference, the top row is JoJoGAN while the second row is Zhu et al. Note how their method distorts chin shape, while JoJoGAN produces strong outputs.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chong, M.J., Forsyth, D. (2022). JoJoGAN: One Shot Face Stylization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-19787-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19786-4
Online ISBN: 978-3-031-19787-1
eBook Packages: Computer ScienceComputer Science (R0)