1 Introduction

Although our society has inherited a huge patrimony of artworks, Computer Vision techniques are usually conceived for realistic images and are rarely applied to visual data coming from the artistic domain, regardless of the potential benefits of having architectures capable of understanding our cultural heritage.

As most of the recent computer vision achievements have relied on learning low-level and high-level features from images depicting the real world, it is also not straightforward to apply pre-trained architectures to the domain of artworks and paintings, whose texture and low-level features are different from those of the real world [2, 22, 24]. An example of the resulting effect is shown in Fig. 1, where we plot the activations of two pre-trained image classification architectures (respectively, VGG-19 [23] and ResNet-101 [8]) on real images and paintings, belonging to the categories of landscapes and portraits. As it can be seen, even though real and artistic images belong to the same semantic classes, their predicted feature distributions remain separate, clearly highlighting the difficulty of the two pre-trained networks to deal with artistic data.

Fig. 1.
figure 1

Domain shift visualization between paintings and real images when applying existing computer vision models. Visualization is obtained by extracting visual features from both real and artistic images (last layer before classification) and by running the t-SNE algorithm [19] on top of that to obtain a 2-d visualization. Details on data collection are reported in Sect. 3.1.

It shall be noted, on the other side, that it is not feasible to re-train state-of-the-art architectures on artistic data, as no large annotated datasets exist in the cultural heritage domain. To address this domain-shift problem while still exploiting the knowledge learned in pre-trained architectures, we have recently proposed a pixel-level domain translation architecture [25], that can map paintings to photo-realistic visualizations by generating translation images which look realistic while preserving the semantic content of the painting. The problem is one of unpaired domain translation, as no annotated pairing exists, i.e. photo realistic visualizations of paintings are rarely available – an when they are, they usually come in limited number. Therefore, the translation is learned by recovering a latent alignment between two unpaired sets: that of paintings and that of real images. The proposed solution is based on a generative cycle-consistent architecture, endowed with multi-scale memory banks which are in charge of memorizing and recovering the details of realistic images, in a semantically consistent way. As a result, generated images look more realistic from a qualitative point of view. Also, they are closer to real images in the feature space of pre-trained architectures, leading to reduced prediction errors without the need of re-training state-of-the-art approaches.

In this paper, after a brief description of our architecture, we provide additional analyses and experimental results to showcase the effectiveness of our approach. Firstly, we evaluate the quality of the generated images in the case of the translation of landscapes, portraits and four different artistic styles, in comparison with other state-of-the-art unpaired translation approaches. Further, we investigate the response of pre-trained architectures for classification, detection and semantic segmentation. As results will show, our approach reduces the entropy of prediction and produces images which are close in feature space to real images. Finally, we conduct a qualitative analysis of the reduction in domain shift, by testing with pre-trained detection, segmentation and captioning networks.

2 Semantically-Aware Image-to-Image Translation

In order to make state-of-the-art computer vision techniques suitable for understanding artistic data, we have not proposed a new specific architecture for this kind of data, but adopted instead a more general solution which fits available data to existing methods. The data adaptation approach we follow consists in the transformation of a painting to a photo-realistic visualization preserving the content and the overall appearance. This is done through generative models [6] equipped with a cycle-consistent constraint [26] and a semantic knowledge of the scene.

2.1 Cycle-Consistency

Early results of translations between paintings and reality have been shown in Zhu et al. [26], on a limited number of artistic settings. In a nutshell, their architecture consists of two Generative Adversarial Networks [6], one taking real photos as input and trained to generate fake paintings, and the other taking real paintings as input and trained to generate fake photos. When a new (realistic or artistic) image is synthesized by a generator, it is brought back to its original domain by the other generator and the resulting distance with the original image becomes the cycle-consistency objective to minimize. Formally, being x a sample from the artistic domain X, y a sample from the realistic domain Y, G and F two functions mapping images from X to Y and from Y to X respectively, the cycle-consistency imposes that \(F(G(x)) \approx x\) and that \(G(F(y)) \approx y\).

Since our objective is that of generating realistic images, rather than style-transferred version of real images, we focused on the first constraint. We noticed, however, that the adversarial objectives and cycle-consistency loss proposed in [26], alone, often fail to preserve semantic consistency and to produce realistic details.

2.2 Semantic-Consistency and Realistic Details

Our first exploration regarded the possibility of constraining our baseline to produce photo-realistic details at multiple scales, and not only an overall plausible image. Our main intuition was that the realism, at sufficiently small scales, can be obtained from existing real details, recovered from previously extracted patches coming from the realistic domain. Following this line, in a preliminary work [24] we reached better results with respect to the Cycle-GAN baseline. Later, we further improved the realism of the generation by considering patches as members of specific semantic classes and trying to preserve this membership during the generation [25].

Memory Banks. Considering details as fixed-size square patches, we model the distribution of realistic details as a set of memory banks, each containing a number of patches obtained from available real photos (i.e. from domain Y). Each memory bank \(\varvec{B}^c\) contains only RGB patches belonging to a specific semantic class c, as predicted by the weakly-supervised model by Hu et al. [10], leading to as many memory banks as the number of different classes found in Y, plus a background class. Patches are extracted in a sliding window manner, with specific sizes and strides.

Since we want the semantic content of an image to be the same before and after the generation, we also need to keep the semantic segmentation masks of source images, i.e. images coming from domain X. In the following, a mask of class c, from source image x, will be denoted as \(\varvec{M}^c_x\).

Semantically-Consistent Generation. In order to make the generator G(x) aware of the semantic content of its input artistic image, we exploit masks \(\varvec{M}^c_x\). They let us split the content of the source image x (and therefore of its translation G(x)) according to the semantic classes composing the scene. During training, when a translated image G(x) is generated, each of its regions belonging to a specific class is split into patches as well. We developed a matching strategy to pair generated patches of class c with their most-similar real patches belonging to memory bank \(\varvec{B}^c\), and we adopted the contextual loss [20] to maximize this similarity. Since the goal of our work is to enhance the performance of existing architectures on artistic data, the exploitation of semantic masks computed on paintings would create a chicken-egg problem. To overcome this limitation, we regularly update masks from the painting x, \(\varvec{M}^c_x\), with masks from the generated image G(x), \(\varvec{M}^c_{G(x)}\), as the training proceeds.

Patch-Similarity Driven Generation. Being \(\varvec{K}^c\) the set of generated patches from regions of G(x) belonging to class c, we compute the cosine similarity between all patches in \(\varvec{K}^c\) and all patches in \(\varvec{B}^c\) and perform a row-wise softmax normalization to the pairwise similarity matrix. The result is an affinity matrix \(\varvec{A}_{ij}^c\), where i indexes \(\varvec{K}^c\) and j indexes \(\varvec{B}^c\). Repeating this operation for each mask found in G(x), we obtain a number of affinity matrices equal to the number of semantic classes in G(x). The contextual loss [20] is in charge of minimizing the distance between pairs of similar patches:

$$\begin{aligned} \mathcal {L}_{CX}^c(\varvec{K}^c, \varvec{B}^c) = -\log \left( \frac{1}{N_K^c}\left( \sum _i \max _j \varvec{A}_{ij}^c\right) \right) , \end{aligned}$$
(1)

with \(N_K^c\) denoting the cardinality of \(\varvec{K}^c\). The complete contextual objective is the summation of Eq. 1 computed for each class c found in G(x), i.e. with different affinity matrices \(\varvec{A}_{ij}^c\):

$$\begin{aligned} \mathcal {L}_{CX}(\varvec{K}, \varvec{B}) = \sum _{c}-\log \left( \frac{1}{N_K^c}\left( \sum _i \max _j \varvec{A}_{ij}^c\right) \right) . \end{aligned}$$
(2)

All the previous discussed operations are repeated considering patches extracted with different size and stride values, using scale-specific memory banks and leading to scale-specific affinity matrices. The overall multi-scale contextual loss is the sum of scale-specific contextual losses:

$$\begin{aligned} \mathcal {L}_{CXMS}(\varvec{K}, \varvec{B}) = \sum _s\mathcal {L}_{CX}^s(\varvec{K}, \varvec{B}). \end{aligned}$$
(3)

Our final loss is the composition of adversarial, cycle-consistent and contextual losses, as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}(G, F, D_X, D_Y, \varvec{K}, \varvec{B})&=\mathcal {L}_{GAN}(G, D_Y, X, Y) + \mathcal {L}_{GAN}(F, D_X, Y, X)\\&+\,\mathcal {L}_{CYC}(G, F) + \lambda \mathcal {L}_{CXMS}(\varvec{K}, \varvec{B}) \end{aligned} \end{aligned}$$
(4)

where \(\mathcal {L}_{GAN}\) and \(\mathcal {L}_{CYC}\) are, respectively, the adversarial and cycle-consistency losses mentioned in Sect. 2.1, and \(\lambda \) controls the contextual loss importance.

Table 1. Evaluation in terms of Kernel Inception Distance \(\times \,100 \pm \mathrm{std.} \times 100\) [3]. Note that results on style-specific settings are obtained from models trained on the generic landscape setting.

3 Experimental Evaluation

3.1 Datasets

Our artistic datasets all come from WikiartFootnote 1. Besides generic landscape artworks, we also collected four sets of paintings considering different artistic styles (i.e. expressionism, impressionism, realism, and romanticism). To validate our model under a different setting, we used a set of generic portraits as additional dataset. The training of the model was performed by using two sets of real images, one depicting real landscapes, while the other representing real people photos. The size of each considered set of images is, respectively, landscape paintings: 2044, portraits: 1714, expressionism: 145, impressionism: 852, realism: 310, romanticism: 256, real landscape photographs: 2048, real people photographs: 2048. Due to the limited size of the style-specific sets of paintings, we only used them to validate the generalization capabilities of our model on unseen landscape images.

3.2 Implementation Details

Our generative networks are inspired by Johnson et al. [12], with two stride-2 convolutions, several residual blocks and two stride-1/2 convolutions. Our discriminators are PatchGANs [11, 15, 16]. Memory banks patches were obtained from the two sets of real images (i.e. real landscape photographs and real people photographs). Paintings masks were updated with generated images masks every 20 epochs, starting from epoch 40. Three patch scales were adopted for the multi-scale version of the model: \(4\times 4\) with stride 4, \(8\times 8\) with stride 5 and \(16\times 16\) with stride 6. The chosen value for \(\lambda \) in Eq. 4 was 0.1. Weights were initialized from a Gaussian distribution with 0 mean and standard deviation 0.02. We trained our model for 300 epochs using Adam optimizer [13] with a batch size of 1. A constant learning rate of 0.0002 was used for the first 100 epochs, making it linearly decay to zero over the next 200 epochs. To reduce training time, an early stopping technique was adopted: if the Fréchet Inception Distance [9] did not decrease for 30 consecutive epochs, the training was stopped.

Table 2. Mean entropy values of images generated through our method and competitors. Results are reported for different computer vision tasks, i.e. classification (VGG-19, ResNet-101), segmentation (Mask\(^X\) R-CNN), and detection (Faster R-CNN).

3.3 Visual Quality Evaluation

A quantitative evaluation of the realism of images generated by our method can be performed through a similarity measure between fake images and target distribution samples representations in the Inception architecture. We adopt the Kernel Inception Distance (KID) [3], which measures the squared Maximum Mean Discrepancy between Inception representations. Compared to the Fréchet Inception Distance [9], the KID metric results to be more reliable especially when it is computed over fewer test images than the dimensionality of the Inception features. Table 1 shows KID values computed between the representations of generated and real images, for different settings. Following the original paper [3], the final KID values were averaged over 100 different splits of size 100, randomly sampled from each setting. As it can be seen, our semantic-aware architecture is able to lower the KID in almost all the settings. Our KID values are compared with those from Cycle-GAN [26] and UNIT [18], which we trained on datasets discussed in Sect. 3.1 adopting original authors’ implementations. The style-transferred reals row reports the KID values of images obtained through Gatys et al. [4] method, considering real photos as content images and randomly sampled paintings (from a specific artistic setting) as style images. The style-specific columns of Table 1 report KID values on expressionism, impressionism, realism and romanticism computed using the models trained on generic landscapes.

Fig. 2.
figure 2

Distribution of different types of features extracted from landscape and portrait images. We report feature distributions coming from VGG-19 and ResNet-101 for classification, Faster R-CNN for detection, and Gram matrices which encode image textures.

3.4 Entropy Analysis

The analysis of the output probabilities from a model can be helpful to evaluate its level of uncertainty about its input. Specifically, we can compute the entropy value of a specific model on a given image, based on its output probabilities. Averaging the entropy values computed on all the images from a given setting, we can determine how much a model is uncertain about its scores on this setting: with an high entropy value, the model will have an high level of uncertainty. Table 2 shows average entropy values of different existing models on original paintings, real photos and images generated through our model and competitors. As it can be noticed, our model brings to the lowest mean entropy in all the considered tasks, i.e. classification (VGG-19 [23], ResNet-101 [8]), semantic segmentation (Mask\(^X\) R-CNN [10]) and detection (Faster R-CNN [21]). The entropy was computed by averaging image entropy for classification, pixel entropy for segmentation and bounding box entropy for detection, on the landscapes and portraits settings.

Fig. 3.
figure 3

Segmentation results on original portraits and their translated versions. Our method leads to improved segmentation performance of existing models on artistic data.

3.5 Feature Distributions Visualization

As mentioned in Sect. 1, there is a strong domain gap between real images and paintings, especially when considering distributions of features coming from a CNN. To verify the reduction of this domain gap, Fig. 2 shows the distributions of different types of features extracted from images generated by our model, their artistic versions, and real images. We compare feature distributions coming from two classification models (i.e. VGG-19 [23], ResNet-101 [8]) and from an object detection network (i.e. Faster R-CNN [21]). We also include feature distributions representing Gram matrices [5] which encode image styles and textures. To represent each image, we extracted a visual feature vector coming either from the fc7 layer of a VGG-19 or the average pooling layer of a ResNet-101. In the case of the detection network, we extracted a set of feature vectors from Faster R-CNN trained on Visual Genome [14], representing the detected image regions which were averaged to obtain a single visual descriptor for each image. To compute Gram matrices, we extracted features from the fc3 layer of a VGG-19. Given these n-dimensional representations of each image (with n equal to 2048 for ResNet-101 and Faster R-CNN, and 4096 for VGG-19 and the Gram matrices), we projected them into a 2-dimensional space by using the t-SNE algorithm [19]. As it can be seen, the distributions of our generated images are closer to the distributions of real images than to those of paintings, thus confirming the reduction of the domain shift between real and artistic images in almost all considered settings.

Fig. 4.
figure 4

Detection results on original paintings and their translated versions. Our model leads to improved results of existing detection models on the artistic domain.

Fig. 5.
figure 5

Image captioning results on original paintings and their realistic versions generated by our model. Textual descriptions of realistic images are in general more detailed and consistent with the subject depicted in the scene.

4 Reducing the Domain Shift: A Qualitative Analysis

The scarcity of annotated artistic datasets does not allow to use standard quantitative evaluation metrics for computer vision models on our data. We can numerically assess the quality of the generation, but we cannot systematically evaluate if a pre-trained segmentation model, for example, works better on our generated images with respect to the original paintings. For this reason we show, through a number of qualitative examples, that a fake-realistic image generated by our architecture is easily understandable by state-of-the-art models, unlike its original painted version. Figure 3 shows painting-generated image pairs which are both given as input to Mask R-CNN [7] pre-trained on COCO [17]: besides improving the score for well-labeled masks, we are also able to reduce the number of false positives (top-left and bottom-right) and false negatives (bottom-left). Figure 4 illustrates bounding boxes predicted by Faster R-CNN [21] pre-trained on Visual Genome [14]: again we demonstrate improved results, detecting true clouds instead of pillows (top-right) or true sky instead of water (top-left and middle-left). Finally, Fig. 5 presents sentences generated by the captioning approach of [1] on paintings and fake generated photos. As it can be observed, textual descriptions become more accurate and aligned with the depicted scene after using our translation approach. Also, we observe a reduction in the number of hallucinations (e.g. a boat in the middle-left example, a dog in the bottom-left example). These observations justify and motivate our work, which is an attempt to enlarge the computer vision field to the still unexplored artistic domain.

5 Conclusion

We have presented an unpaired image-to-image translation approach which can translate paintings to photo-realistic visualizations. Our work is motivated by the poor performance of pre-trained architectures on artistic data, and by need of Computer Vision pipelines capable of understanding the cultural heritage. The presented approach is based on a cycle-consistent translation framework endowed with multi-scale memory banks of patches, so that generated patches are constrained to be similar to real ones. Further, it also includes a semantic-aware strategy so to impose the semantic correctness of generated patches. In this paper, we have conducted additional experiments and evaluations: firstly, we have assessed the visual quality of generated images, in the case of landscapes, portraits and paintings from different styles. Further, we have investigated the response of pre-trained architectures in terms of entropy of prediction and feature distribution. Results have confirmed that our approach is able to generate images which look realistic both from a qualitative point of view and in terms of the predictions given by pre-trained architectures. Finally, as an additional contribution we have presented some qualitative predictions given by detection, segmentation and captioning networks on images generated by our approach.