Abstract
State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability of current architectures to the automatic understanding of the cultural heritage. This is mainly due to the difference in texture and low-level feature distribution between artistic and real images, on which state-of-the-art approaches are usually trained. To enhance the applicability of pre-trained architectures on artistic data, we have recently proposed an unpaired domain translation approach which can translate artworks to photo-realistic visualizations. Our approach leverages semantically-aware memory banks of real patches, which are used to drive the generation of the translated image while improving its realism. In this paper, we provide additional analyses and experimental results which demonstrate the effectiveness of our approach. In particular, we evaluate the quality of generated results in the case of the translation of landscapes, portraits and of paintings coming from four different styles using automatic distance metrics. Also, we analyze the response of pre-trained architecture for classification, detection and segmentation both in terms of feature distribution and entropy of prediction, and show that our approach effectively reduces the domain shift of paintings. As an additional contribution, we also provide a qualitative analysis of the reduction of the domain shift for detection, segmentation and image captioning.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
Although our society has inherited a huge patrimony of artworks, Computer Vision techniques are usually conceived for realistic images and are rarely applied to visual data coming from the artistic domain, regardless of the potential benefits of having architectures capable of understanding our cultural heritage.
As most of the recent computer vision achievements have relied on learning low-level and high-level features from images depicting the real world, it is also not straightforward to apply pre-trained architectures to the domain of artworks and paintings, whose texture and low-level features are different from those of the real world [2, 22, 24]. An example of the resulting effect is shown in Fig. 1, where we plot the activations of two pre-trained image classification architectures (respectively, VGG-19 [23] and ResNet-101 [8]) on real images and paintings, belonging to the categories of landscapes and portraits. As it can be seen, even though real and artistic images belong to the same semantic classes, their predicted feature distributions remain separate, clearly highlighting the difficulty of the two pre-trained networks to deal with artistic data.
It shall be noted, on the other side, that it is not feasible to re-train state-of-the-art architectures on artistic data, as no large annotated datasets exist in the cultural heritage domain. To address this domain-shift problem while still exploiting the knowledge learned in pre-trained architectures, we have recently proposed a pixel-level domain translation architecture [25], that can map paintings to photo-realistic visualizations by generating translation images which look realistic while preserving the semantic content of the painting. The problem is one of unpaired domain translation, as no annotated pairing exists, i.e. photo realistic visualizations of paintings are rarely available – an when they are, they usually come in limited number. Therefore, the translation is learned by recovering a latent alignment between two unpaired sets: that of paintings and that of real images. The proposed solution is based on a generative cycle-consistent architecture, endowed with multi-scale memory banks which are in charge of memorizing and recovering the details of realistic images, in a semantically consistent way. As a result, generated images look more realistic from a qualitative point of view. Also, they are closer to real images in the feature space of pre-trained architectures, leading to reduced prediction errors without the need of re-training state-of-the-art approaches.
In this paper, after a brief description of our architecture, we provide additional analyses and experimental results to showcase the effectiveness of our approach. Firstly, we evaluate the quality of the generated images in the case of the translation of landscapes, portraits and four different artistic styles, in comparison with other state-of-the-art unpaired translation approaches. Further, we investigate the response of pre-trained architectures for classification, detection and semantic segmentation. As results will show, our approach reduces the entropy of prediction and produces images which are close in feature space to real images. Finally, we conduct a qualitative analysis of the reduction in domain shift, by testing with pre-trained detection, segmentation and captioning networks.
2 Semantically-Aware Image-to-Image Translation
In order to make state-of-the-art computer vision techniques suitable for understanding artistic data, we have not proposed a new specific architecture for this kind of data, but adopted instead a more general solution which fits available data to existing methods. The data adaptation approach we follow consists in the transformation of a painting to a photo-realistic visualization preserving the content and the overall appearance. This is done through generative models [6] equipped with a cycle-consistent constraint [26] and a semantic knowledge of the scene.
2.1 Cycle-Consistency
Early results of translations between paintings and reality have been shown in Zhu et al. [26], on a limited number of artistic settings. In a nutshell, their architecture consists of two Generative Adversarial Networks [6], one taking real photos as input and trained to generate fake paintings, and the other taking real paintings as input and trained to generate fake photos. When a new (realistic or artistic) image is synthesized by a generator, it is brought back to its original domain by the other generator and the resulting distance with the original image becomes the cycle-consistency objective to minimize. Formally, being x a sample from the artistic domain X, y a sample from the realistic domain Y, G and F two functions mapping images from X to Y and from Y to X respectively, the cycle-consistency imposes that \(F(G(x)) \approx x\) and that \(G(F(y)) \approx y\).
Since our objective is that of generating realistic images, rather than style-transferred version of real images, we focused on the first constraint. We noticed, however, that the adversarial objectives and cycle-consistency loss proposed in [26], alone, often fail to preserve semantic consistency and to produce realistic details.
2.2 Semantic-Consistency and Realistic Details
Our first exploration regarded the possibility of constraining our baseline to produce photo-realistic details at multiple scales, and not only an overall plausible image. Our main intuition was that the realism, at sufficiently small scales, can be obtained from existing real details, recovered from previously extracted patches coming from the realistic domain. Following this line, in a preliminary work [24] we reached better results with respect to the Cycle-GAN baseline. Later, we further improved the realism of the generation by considering patches as members of specific semantic classes and trying to preserve this membership during the generation [25].
Memory Banks. Considering details as fixed-size square patches, we model the distribution of realistic details as a set of memory banks, each containing a number of patches obtained from available real photos (i.e. from domain Y). Each memory bank \(\varvec{B}^c\) contains only RGB patches belonging to a specific semantic class c, as predicted by the weakly-supervised model by Hu et al. [10], leading to as many memory banks as the number of different classes found in Y, plus a background class. Patches are extracted in a sliding window manner, with specific sizes and strides.
Since we want the semantic content of an image to be the same before and after the generation, we also need to keep the semantic segmentation masks of source images, i.e. images coming from domain X. In the following, a mask of class c, from source image x, will be denoted as \(\varvec{M}^c_x\).
Semantically-Consistent Generation. In order to make the generator G(x) aware of the semantic content of its input artistic image, we exploit masks \(\varvec{M}^c_x\). They let us split the content of the source image x (and therefore of its translation G(x)) according to the semantic classes composing the scene. During training, when a translated image G(x) is generated, each of its regions belonging to a specific class is split into patches as well. We developed a matching strategy to pair generated patches of class c with their most-similar real patches belonging to memory bank \(\varvec{B}^c\), and we adopted the contextual loss [20] to maximize this similarity. Since the goal of our work is to enhance the performance of existing architectures on artistic data, the exploitation of semantic masks computed on paintings would create a chicken-egg problem. To overcome this limitation, we regularly update masks from the painting x, \(\varvec{M}^c_x\), with masks from the generated image G(x), \(\varvec{M}^c_{G(x)}\), as the training proceeds.
Patch-Similarity Driven Generation. Being \(\varvec{K}^c\) the set of generated patches from regions of G(x) belonging to class c, we compute the cosine similarity between all patches in \(\varvec{K}^c\) and all patches in \(\varvec{B}^c\) and perform a row-wise softmax normalization to the pairwise similarity matrix. The result is an affinity matrix \(\varvec{A}_{ij}^c\), where i indexes \(\varvec{K}^c\) and j indexes \(\varvec{B}^c\). Repeating this operation for each mask found in G(x), we obtain a number of affinity matrices equal to the number of semantic classes in G(x). The contextual loss [20] is in charge of minimizing the distance between pairs of similar patches:
with \(N_K^c\) denoting the cardinality of \(\varvec{K}^c\). The complete contextual objective is the summation of Eq. 1 computed for each class c found in G(x), i.e. with different affinity matrices \(\varvec{A}_{ij}^c\):
All the previous discussed operations are repeated considering patches extracted with different size and stride values, using scale-specific memory banks and leading to scale-specific affinity matrices. The overall multi-scale contextual loss is the sum of scale-specific contextual losses:
Our final loss is the composition of adversarial, cycle-consistent and contextual losses, as follows:
where \(\mathcal {L}_{GAN}\) and \(\mathcal {L}_{CYC}\) are, respectively, the adversarial and cycle-consistency losses mentioned in Sect. 2.1, and \(\lambda \) controls the contextual loss importance.
3 Experimental Evaluation
3.1 Datasets
Our artistic datasets all come from WikiartFootnote 1. Besides generic landscape artworks, we also collected four sets of paintings considering different artistic styles (i.e. expressionism, impressionism, realism, and romanticism). To validate our model under a different setting, we used a set of generic portraits as additional dataset. The training of the model was performed by using two sets of real images, one depicting real landscapes, while the other representing real people photos. The size of each considered set of images is, respectively, landscape paintings: 2044, portraits: 1714, expressionism: 145, impressionism: 852, realism: 310, romanticism: 256, real landscape photographs: 2048, real people photographs: 2048. Due to the limited size of the style-specific sets of paintings, we only used them to validate the generalization capabilities of our model on unseen landscape images.
3.2 Implementation Details
Our generative networks are inspired by Johnson et al. [12], with two stride-2 convolutions, several residual blocks and two stride-1/2 convolutions. Our discriminators are PatchGANs [11, 15, 16]. Memory banks patches were obtained from the two sets of real images (i.e. real landscape photographs and real people photographs). Paintings masks were updated with generated images masks every 20 epochs, starting from epoch 40. Three patch scales were adopted for the multi-scale version of the model: \(4\times 4\) with stride 4, \(8\times 8\) with stride 5 and \(16\times 16\) with stride 6. The chosen value for \(\lambda \) in Eq. 4 was 0.1. Weights were initialized from a Gaussian distribution with 0 mean and standard deviation 0.02. We trained our model for 300 epochs using Adam optimizer [13] with a batch size of 1. A constant learning rate of 0.0002 was used for the first 100 epochs, making it linearly decay to zero over the next 200 epochs. To reduce training time, an early stopping technique was adopted: if the Fréchet Inception Distance [9] did not decrease for 30 consecutive epochs, the training was stopped.
3.3 Visual Quality Evaluation
A quantitative evaluation of the realism of images generated by our method can be performed through a similarity measure between fake images and target distribution samples representations in the Inception architecture. We adopt the Kernel Inception Distance (KID) [3], which measures the squared Maximum Mean Discrepancy between Inception representations. Compared to the Fréchet Inception Distance [9], the KID metric results to be more reliable especially when it is computed over fewer test images than the dimensionality of the Inception features. Table 1 shows KID values computed between the representations of generated and real images, for different settings. Following the original paper [3], the final KID values were averaged over 100 different splits of size 100, randomly sampled from each setting. As it can be seen, our semantic-aware architecture is able to lower the KID in almost all the settings. Our KID values are compared with those from Cycle-GAN [26] and UNIT [18], which we trained on datasets discussed in Sect. 3.1 adopting original authors’ implementations. The style-transferred reals row reports the KID values of images obtained through Gatys et al. [4] method, considering real photos as content images and randomly sampled paintings (from a specific artistic setting) as style images. The style-specific columns of Table 1 report KID values on expressionism, impressionism, realism and romanticism computed using the models trained on generic landscapes.
3.4 Entropy Analysis
The analysis of the output probabilities from a model can be helpful to evaluate its level of uncertainty about its input. Specifically, we can compute the entropy value of a specific model on a given image, based on its output probabilities. Averaging the entropy values computed on all the images from a given setting, we can determine how much a model is uncertain about its scores on this setting: with an high entropy value, the model will have an high level of uncertainty. Table 2 shows average entropy values of different existing models on original paintings, real photos and images generated through our model and competitors. As it can be noticed, our model brings to the lowest mean entropy in all the considered tasks, i.e. classification (VGG-19 [23], ResNet-101 [8]), semantic segmentation (Mask\(^X\) R-CNN [10]) and detection (Faster R-CNN [21]). The entropy was computed by averaging image entropy for classification, pixel entropy for segmentation and bounding box entropy for detection, on the landscapes and portraits settings.
3.5 Feature Distributions Visualization
As mentioned in Sect. 1, there is a strong domain gap between real images and paintings, especially when considering distributions of features coming from a CNN. To verify the reduction of this domain gap, Fig. 2 shows the distributions of different types of features extracted from images generated by our model, their artistic versions, and real images. We compare feature distributions coming from two classification models (i.e. VGG-19 [23], ResNet-101 [8]) and from an object detection network (i.e. Faster R-CNN [21]). We also include feature distributions representing Gram matrices [5] which encode image styles and textures. To represent each image, we extracted a visual feature vector coming either from the fc7 layer of a VGG-19 or the average pooling layer of a ResNet-101. In the case of the detection network, we extracted a set of feature vectors from Faster R-CNN trained on Visual Genome [14], representing the detected image regions which were averaged to obtain a single visual descriptor for each image. To compute Gram matrices, we extracted features from the fc3 layer of a VGG-19. Given these n-dimensional representations of each image (with n equal to 2048 for ResNet-101 and Faster R-CNN, and 4096 for VGG-19 and the Gram matrices), we projected them into a 2-dimensional space by using the t-SNE algorithm [19]. As it can be seen, the distributions of our generated images are closer to the distributions of real images than to those of paintings, thus confirming the reduction of the domain shift between real and artistic images in almost all considered settings.
4 Reducing the Domain Shift: A Qualitative Analysis
The scarcity of annotated artistic datasets does not allow to use standard quantitative evaluation metrics for computer vision models on our data. We can numerically assess the quality of the generation, but we cannot systematically evaluate if a pre-trained segmentation model, for example, works better on our generated images with respect to the original paintings. For this reason we show, through a number of qualitative examples, that a fake-realistic image generated by our architecture is easily understandable by state-of-the-art models, unlike its original painted version. Figure 3 shows painting-generated image pairs which are both given as input to Mask R-CNN [7] pre-trained on COCO [17]: besides improving the score for well-labeled masks, we are also able to reduce the number of false positives (top-left and bottom-right) and false negatives (bottom-left). Figure 4 illustrates bounding boxes predicted by Faster R-CNN [21] pre-trained on Visual Genome [14]: again we demonstrate improved results, detecting true clouds instead of pillows (top-right) or true sky instead of water (top-left and middle-left). Finally, Fig. 5 presents sentences generated by the captioning approach of [1] on paintings and fake generated photos. As it can be observed, textual descriptions become more accurate and aligned with the depicted scene after using our translation approach. Also, we observe a reduction in the number of hallucinations (e.g. a boat in the middle-left example, a dog in the bottom-left example). These observations justify and motivate our work, which is an attempt to enlarge the computer vision field to the still unexplored artistic domain.
5 Conclusion
We have presented an unpaired image-to-image translation approach which can translate paintings to photo-realistic visualizations. Our work is motivated by the poor performance of pre-trained architectures on artistic data, and by need of Computer Vision pipelines capable of understanding the cultural heritage. The presented approach is based on a cycle-consistent translation framework endowed with multi-scale memory banks of patches, so that generated patches are constrained to be similar to real ones. Further, it also includes a semantic-aware strategy so to impose the semantic correctness of generated patches. In this paper, we have conducted additional experiments and evaluations: firstly, we have assessed the visual quality of generated images, in the case of landscapes, portraits and paintings from different styles. Further, we have investigated the response of pre-trained architectures in terms of entropy of prediction and feature distribution. Results have confirmed that our approach is able to generate images which look realistic both from a qualitative point of view and in terms of the predictions given by pre-trained architectures. Finally, as an additional contribution we have presented some qualitative predictions given by detection, segmentation and captioning networks on images generated by our approach.
Notes
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Baraldi, L., Cornia, M., Grana, C., Cucchiara, R.: Aligning text and document illustrations: towards visually explainable digital humanities. In: Proceedings of the International Conference on Pattern Recognition (2018)
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. In: Proceedings of the International Conference on Learning Representations (2018)
Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the International Conference on Computer Vision (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a Nash equilibrium. In: Advances in Neural Information Processing Systems (2017)
Hu, R., Dollár, P., He, K., Darrell, T., Girshick, R.: Learning to segment every thing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Proceedings of the European Conference on Computer Vision (2016)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations (2015)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Li, C., Wand, M.: Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 702–716. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_43
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems (2017)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
Mechrez, R., Talmi, I., Shama, F., Zelnik-Manor, L.: Learning to maintain natural image statistics. arXiv preprint arXiv:1803.04626 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Shen, X., Efros, A.A., Mathieu, A.: Discovering visual patterns in art collections with spatially-consistent feature learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (2015)
Tomei, M., Baraldi, L., Cornia, M., Cucchiara, R.: What was Monet seeing while painting? Translating artworks to photo-realistic images. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 601–616. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11012-3_46
Tomei, M., Cornia, M., Baraldi, L., Cucchiara, R.: Art2Real: unfolding the reality of artworks via semantically-aware image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the International Conference on Computer Vision (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Tomei, M., Cornia, M., Baraldi, L., Cucchiara, R. (2019). Image-to-Image Translation to Unfold the Reality of Artworks: An Empirical Analysis. In: Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N. (eds) Image Analysis and Processing – ICIAP 2019. ICIAP 2019. Lecture Notes in Computer Science(), vol 11752. Springer, Cham. https://doi.org/10.1007/978-3-030-30645-8_67
Download citation
DOI: https://doi.org/10.1007/978-3-030-30645-8_67
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30644-1
Online ISBN: 978-3-030-30645-8
eBook Packages: Computer ScienceComputer Science (R0)