Controlling BigGAN Image Generation with a Segmentation Network

Jaiswal, Aman; Sodhi, Harpreet Singh; Muzamil H, Mohamed; Chandhok, Rajveen Singh; Oore, Sageev; Sastry, Chandramouli Shama

doi:10.1007/978-3-030-88942-5_21

Aman Jaiswal¹⁰,
Harpreet Singh Sodhi¹⁰,
Mohamed Muzamil H¹⁰,
Rajveen Singh Chandhok¹⁰,
Sageev Oore¹⁰ &
…
Chandramouli Shama Sastry¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12986))

Included in the following conference series:

International Conference on Discovery Science

1510 Accesses
2 Citations

Abstract

GANS have been used for a variety of unconditional and conditional generation tasks; while class-conditional generation can be directly integrated into the training process, integrating more sophisticated conditioning signals within the training is not as straightforward. In this work, we consider the task of sampling from P(X) such that the silhouette of (the subject of) X matches the silhouette of (the subject of) a given image; that is, we not only specify what to generate, but we also control where to put it: more generally, we allow a mask (this is actually another image) to control the silhouette of the object to be generated. The mask is itself the result of a segmentation system applied to a user-provided image. To achieve this, we use pre-trained BigGAN and State-of-the-art segmentation models (e.g. DeepLabV3 and FCN) as follows: we first sample a random latent vector z from the Gaussian Prior of BigGAN and then iteratively modify the latent vector until the silhouettes of \(X=G(z)\) and the reference image match. While the BigGAN is a class-conditional generative model trained on the 1000 classes of ImageNet, the segmentation models are trained on the 20 classes of the PASCAL VOC dataset; we choose the “Dog” and the “Cat” classes to demonstrate our controlled generation model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019). https://openreview.net/forum?id=B1xsqj09Fm
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs (2016)
Google Scholar
Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587 (2017). http://arxiv.org/abs/1706.05587
Creswell, A., Bharath, A.A.: Inverting the generator of a generative adversarial network (ii) (2018)
Google Scholar
Goetschalckx, L., Andonian, A., Oliva, A., Isola, P.: GANalyze: toward visual definitions of cognitive image properties (2019)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020). https://doi.org/10.1145/3422622
Article MathSciNet Google Scholar
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation (2018)
Google Scholar
Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: GANspace: Discovering interpretable GAN controls (2020)
Google Scholar
Jahanian, A., Chai, L., Isola, P.: On the “steerability” of generative adversarial networks (2020)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks (2019)
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN (2020)
Google Scholar
Korshunova, I., Shi, W., Dambre, J., Theis, L.: Fast face-swap using convolutional neural networks (2017)
Google Scholar
Kulkarni, T.D., Whitney, W., Kohli, P., Tenenbaum, J.B.: Deep convolutional inverse graphics network (2015)
Google Scholar
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: HoloGAN: unsupervised learning of 3D representations from natural images (2019)
Google Scholar
Plumerault, A., Borgne, H.L., Hudelot, C.: Controlling generative models with continuous factors of variations (2020)
Google Scholar
Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of GANs for semantic face editing (2020)
Google Scholar
Singh, K.K., Ojha, U., Lee, Y.J.: FineGAN: unsupervised hierarchical disentanglement for fine-grained object generation and discovery (2019)
Google Scholar
Srinivas, S., Sarvadevabhatla, R.K., Mopuri, K.R., Prabhu, N., Kruthiventi, S.S.S., Babu, R.V.: A taxonomy of deep convolutional neural nets for computer vision. Front. Robot. AI 2 (2016). https://doi.org/10.3389/frobt.2015.00036
Tran, L., Yin, X., Liu, X.: Disentangled representation learning GAN for pose-invariant face recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1283–1292 (2017). https://doi.org/10.1109/CVPR.2017.141
Yang, C., Shen, Y., Zhou, B.: Semantic hierarchy emerges in deep generative representations for scene synthesis (2020)
Google Scholar
Yang, M., Rokeby, D., Snelgrove, X.: Mask-guided discovery of semantic manifolds in generative models (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Dalhousie University, Halifax, NS, Canada
Aman Jaiswal, Harpreet Singh Sodhi, Mohamed Muzamil H, Rajveen Singh Chandhok, Sageev Oore & Chandramouli Shama Sastry

Authors

Aman Jaiswal
View author publications
You can also search for this author in PubMed Google Scholar
Harpreet Singh Sodhi
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Muzamil H
View author publications
You can also search for this author in PubMed Google Scholar
Rajveen Singh Chandhok
View author publications
You can also search for this author in PubMed Google Scholar
Sageev Oore
View author publications
You can also search for this author in PubMed Google Scholar
Chandramouli Shama Sastry
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Aman Jaiswal or Harpreet Singh Sodhi .

Editor information

Editors and Affiliations

Universidade do Porto and Fraunhofer Portugal AICOS, Porto, Portugal
Carlos Soares
Dalhousie University, Halifax, NS, Canada
Luis Torgo

Appendices

A Alternative Ensemble methods

The segmentation models used share similar architecture(Resnet101) and training dataset. Although the range of the logits vary from network to network, we could not find any evidence that computing an average across the logits produced by different segmentation modules should not necessarily produce good results. Therefore, we tried averaging the logits and then applying soft-max on the channel dimension before computing BCE Loss. The results of the average segmentation are shown in (Fig. 9). We also tried a method where we average the losses as illustrated in (Fig. 10). This method did not work as well as the method illustrated in (Fig. 7), The reason for this deviancy can be the BCE loss that we used while implementing this method.

B Implementation Details

We use a pytorch ported version^{Footnote 2} of the original model(As illustrated in Fig. 7). The target and the generated image are used as inputs to two separate segmentation models. The pretrained segmentation models were taken from pytorch hub^{Footnote 3}\(^{,}\)^{Footnote 4}.

We use Adam optimzer with a learning rate of \(1e-1\) and beta values of 0.5–0.99. The model is run for a maximum of 25 epochs. The segmentation models expects the RGB channel to have the corresponding Mean(\(\mu )=[0.485, 0.456, 0.406]\) and Variance (\( \sigma )=[0.229, 0.224, 0.225]\) values, This is done explicitly for every generated image. Mean squared error is used for computing the loss over the “Dog” channel of the two segmentation maps. Weighted average with the ratio 0.6 : 0.4 is used for the segmentation models because the DeepLabv3 segmentation model works better than FCN resnet101 segmentation model.

C Loss and Channel Experiments

During training, we tried losses including cross-entropy, binary cross-entropy, soft cross-entropy, and mean squared error. The results of the segmentation models used contains 21 channels, where each channel outputs un-normalised probability values for pixels belonging to a particular class. Channel \(0\) is the “Background class”. While performing the transformation experiments shown in Fig. 5, we used binary cross-entropy loss. The channels used for computing the loss were the background channel and the “dog” channel. Figure 11 illustrates how the model tries to fit the background while reducing the loss. We found excluding the background channel and computing the mean squared error only on ‘dog’ channel works best.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jaiswal, A., Sodhi, H.S., Muzamil H, M., Chandhok, R., Oore, S., Sastry, C.S. (2021). Controlling BigGAN Image Generation with a Segmentation Network. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-88942-5_21
Published: 09 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88941-8
Online ISBN: 978-3-030-88942-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics