Abstract
GANS have been used for a variety of unconditional and conditional generation tasks; while class-conditional generation can be directly integrated into the training process, integrating more sophisticated conditioning signals within the training is not as straightforward. In this work, we consider the task of sampling from P(X) such that the silhouette of (the subject of) X matches the silhouette of (the subject of) a given image; that is, we not only specify what to generate, but we also control where to put it: more generally, we allow a mask (this is actually another image) to control the silhouette of the object to be generated. The mask is itself the result of a segmentation system applied to a user-provided image. To achieve this, we use pre-trained BigGAN and State-of-the-art segmentation models (e.g. DeepLabV3 and FCN) as follows: we first sample a random latent vector z from the Gaussian Prior of BigGAN and then iteratively modify the latent vector until the silhouettes of \(X=G(z)\) and the reference image match. While the BigGAN is a class-conditional generative model trained on the 1000 classes of ImageNet, the segmentation models are trained on the 20 classes of the PASCAL VOC dataset; we choose the “Dog” and the “Cat” classes to demonstrate our controlled generation model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019). https://openreview.net/forum?id=B1xsqj09Fm
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs (2016)
Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587 (2017). http://arxiv.org/abs/1706.05587
Creswell, A., Bharath, A.A.: Inverting the generator of a generative adversarial network (ii) (2018)
Goetschalckx, L., Andonian, A., Oliva, A., Isola, P.: GANalyze: toward visual definitions of cognitive image properties (2019)
Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020). https://doi.org/10.1145/3422622
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation (2018)
Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: GANspace: Discovering interpretable GAN controls (2020)
Jahanian, A., Chai, L., Isola, P.: On the “steerability” of generative adversarial networks (2020)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN (2020)
Korshunova, I., Shi, W., Dambre, J., Theis, L.: Fast face-swap using convolutional neural networks (2017)
Kulkarni, T.D., Whitney, W., Kohli, P., Tenenbaum, J.B.: Deep convolutional inverse graphics network (2015)
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: HoloGAN: unsupervised learning of 3D representations from natural images (2019)
Plumerault, A., Borgne, H.L., Hudelot, C.: Controlling generative models with continuous factors of variations (2020)
Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of GANs for semantic face editing (2020)
Singh, K.K., Ojha, U., Lee, Y.J.: FineGAN: unsupervised hierarchical disentanglement for fine-grained object generation and discovery (2019)
Srinivas, S., Sarvadevabhatla, R.K., Mopuri, K.R., Prabhu, N., Kruthiventi, S.S.S., Babu, R.V.: A taxonomy of deep convolutional neural nets for computer vision. Front. Robot. AI 2 (2016). https://doi.org/10.3389/frobt.2015.00036
Tran, L., Yin, X., Liu, X.: Disentangled representation learning GAN for pose-invariant face recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1283–1292 (2017). https://doi.org/10.1109/CVPR.2017.141
Yang, C., Shen, Y., Zhou, B.: Semantic hierarchy emerges in deep generative representations for scene synthesis (2020)
Yang, M., Rokeby, D., Snelgrove, X.: Mask-guided discovery of semantic manifolds in generative models (2021)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendices
A Alternative Ensemble methods
The segmentation models used share similar architecture(Resnet101) and training dataset. Although the range of the logits vary from network to network, we could not find any evidence that computing an average across the logits produced by different segmentation modules should not necessarily produce good results. Therefore, we tried averaging the logits and then applying soft-max on the channel dimension before computing BCE Loss. The results of the average segmentation are shown in (Fig. 9). We also tried a method where we average the losses as illustrated in (Fig. 10). This method did not work as well as the method illustrated in (Fig. 7), The reason for this deviancy can be the BCE loss that we used while implementing this method.
B Implementation Details
We use a pytorch ported versionFootnote 2 of the original model(As illustrated in Fig. 7). The target and the generated image are used as inputs to two separate segmentation models. The pretrained segmentation models were taken from pytorch hubFootnote 3\(^{,}\)Footnote 4.
We use Adam optimzer with a learning rate of \(1e-1\) and beta values of 0.5–0.99. The model is run for a maximum of 25 epochs. The segmentation models expects the RGB channel to have the corresponding Mean(\(\mu )=[0.485, 0.456, 0.406]\) and Variance (\( \sigma )=[0.229, 0.224, 0.225]\) values, This is done explicitly for every generated image. Mean squared error is used for computing the loss over the “Dog” channel of the two segmentation maps. Weighted average with the ratio 0.6 : 0.4 is used for the segmentation models because the DeepLabv3 segmentation model works better than FCN resnet101 segmentation model.
C Loss and Channel Experiments
During training, we tried losses including cross-entropy, binary cross-entropy, soft cross-entropy, and mean squared error. The results of the segmentation models used contains 21 channels, where each channel outputs un-normalised probability values for pixels belonging to a particular class. Channel \(0\) is the “Background class”. While performing the transformation experiments shown in Fig. 5, we used binary cross-entropy loss. The channels used for computing the loss were the background channel and the “dog” channel. Figure 11 illustrates how the model tries to fit the background while reducing the loss. We found excluding the background channel and computing the mean squared error only on ‘dog’ channel works best.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Jaiswal, A., Sodhi, H.S., Muzamil H, M., Chandhok, R., Oore, S., Sastry, C.S. (2021). Controlling BigGAN Image Generation with a Segmentation Network. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-88942-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88941-8
Online ISBN: 978-3-030-88942-5
eBook Packages: Computer ScienceComputer Science (R0)