Keywords

1 Introduction

Most learning based models rely on the assumption that training and test data are drawn from the same distribution. Unfortunately, this assumption does not hold in many real-world computer vision applications due to unpredictable changes in the environment (e.g. weather, illumination, occlusions). Despite the progress brought in visual recognition by deep learning and the availability of large fully annotated dataset, models trained on a given distribution different from the test one still struggle to generalize, giving poor performance. This problem is commonly referred to as domain shift and it is especially relevant in computer vision due to the large appearance variability of the visual data. The domain shift problem has been widely studied in the last decade and many techniques have been proposed to limit its effect [29].

Domain Adaptation (DA) methods specifically focus on learning a given task in a source domain, then transferring the acquired knowledge in the domain of interest, i.e. the target domain. In past years, researchers studied the theoretical aspects of this problem [1, 28] and proposed several shallow [14] and deep learning based [4, 10, 21, 22] algorithms. However, recent studies [7] have shown that the domain shift problem can only be alleviated but not entirely solved even adopting deep architectures.

Often, DA methods consider a single-source, single-target scenario, but a setting in which multiple source domains are available is arguably more interesting and realistic. In fact, datasets could contain images taken with different cameras, from many viewpoints or with different lighting conditions. Approaching such cases with single-source DA algorithms will lead to poor results. For this reason, many DA methods have been proposed to learn from multiple sources [3, 8, 25, 28, 35]. However, these approaches assume to know the domain label of each sample. A more challenging scenario arises when the domain to which a sample belongs is not known in advance. This problem, also referred to as latent domain discovery, consider the presence of multiple but mixed sources and/or target domains, offering either partial or no information, about the ground-truth domain of each sample. In previous years, few works [11, 13, 27, 36] focused on this setting, simultaneously performing the discovery of latent domains and using the information to learn a classification model for the target one.

In this paper, we propose a novel formulation for the domain discovery algorithm proposed in [27]. In particular, we enhance the domain classifier training by employing a different objective. This objective is based on (i) producing multiple domain predictions on perturbations of the features of a given sample and (ii) applying on those predictions the recently proposed Min-Entropy Consensus (MEC) Loss [30]. This loss enforces both consistency and low entropy for the perturbed domain predictions of a single sample. An overview of the method is reported in Fig. 1. Our empirical study demonstrates that we are able to extract meaningful latent domains from the source samples, achieving better performance than previous latent domain discovery DA methods on popular benchmarks, such as Office-31 [32] and PACS [18].

2 Related Works

Deep Domain Adaptation. In recent years, deep learning based DA approaches have show to be very effective in addressing this task. Usually, robust domain-invariant features are learned in deep architectures using either supervised neural networks [4, 10, 21, 30] or deep autoencoders [39]. Some methods [21, 22] rely on the idea of aligning source and target features by minimizing the Maximum Mean Discrepancy (MMD). A different approach is represented by methods that operates in a domain-adversarial setting [10], i.e. they focus on learning a domain-agnostic feature space by minimizing a domain confusion loss. Recent works have also explored the use of generative models [2, 31]. Our work is close to recent trends exploring the use of domain-specific batch-normalization layers [4, 5, 25, 26], since we use a variant of those layers [27] to adapt the model from the latent source domains to the target one. Our approach is also linked to consistency-based DA strategies [9, 30, 33]. Different from these works, we employ a consistency loss [30] for learning the domain prediction branch. Our work is also related to multi-source DA [8, 35, 37] and domain generalization [3, 6, 18, 23, 24]. Similarly to these scenarios, we assume the presence of multiple source domains. However, in our case these domains are mixed and we must discover them in order to exploit the advantages of multi-source DA approaches.

Latent Domain Discovery for DA. Very few works tried to address the latent domain discovery problem in the literature. While previous works on shallow features considered the use multiple Gaussian distributions [13], domain distinctiveness [11], exemplar SVMs [19, 38] and manifold learning [36], only one work addressed this problem in the context of deep DA [27]. In [27] we proposed to exploit a domain prediction branch and domain alignment layers [4, 5] to discover latent domains and improve the DA performances on the target domain. While in [27] the domain prediction branch was trained through an entropy loss, in this work we show how we can achieve similar or better results by employing a different loss, [30] which encourages both low entropy and consistency on the domain predictions for perturbations of the same input features.

3 Method

3.1 Problem Formulation and Notation

As in standard Unsupervised Domain Adaptation (UDA), we assume to have access to a source and a target domain. The source domain contains semantically labeled samples, while the target domain contains only unlabelled samples. However, different from standard UDA, we assume that the source domain is composed of a mixture of multiple domains and, contrary to multi-source DA, we do not assume to know to which domain each source sample belongs. Following previous works [27], we assume to have \(\mathsf {k}\) source domains. Notice that this number might not be known a priori: in our current formulation we leave it has an hyperparameter. Source domains are characterized by unknown probability distributions \(p_{\mathtt {xy}}^{s_1},\dots ,p_{\mathtt {xy}}^{s_\mathsf {k_s}}\) defined over \(\mathcal {X}\times \mathcal {Y}\), where \(\mathcal {X}\) is the input space (e.g. images in our case) and \(\mathcal {Y}\) the output space (e.g. object categories). The source data are thus modelled as a set \(\mathcal {S}=\{(x_1^s,y_1^s),\dots ,(x_\mathsf {n}^s,y_\mathsf {n}^s)\}\) with \(x_\mathcal {S}=\{x_1^s,\dots ,x_\mathsf {n}^s\}\) and \(y_\mathcal {S}=\{y_1^s,\dots ,y_\mathsf {n}^s\}\), the source data and label sets, respectively. The set \(\mathcal {S}\) contains i.i.d. observations from a mixture distribution \(p_{\mathtt {xy}}^s=\sum _{i=1}^\mathsf {k_s} \pi _{s_i} p_{\mathtt {xy}}^{s_i}\), where \(\pi _{s_i}\) is the unknown probability of sampling from a source domain \(s_i\). Similarly, we assume to have target domain data \(\mathcal {T}=\{x_1^t,\dots ,x_\mathsf {m}^t\}\) of i.i.d. observations drawn from \(p_\mathtt {x}^t\).

During training we receive semantically labeled source samples with unknown domain membership plus unlabeled target samples. Our goal is to learn a model able to address a given task (i.e. classification) in the target domain. Following [27], we address this task by using domain specific batch-normalization [15] (BN) layers to perform DA [4, 5, 20, 26]. These layers are influenced by the latent domain discovery process, performed by a domain prediction branch. With respect to [27] we propose a new objective for the domain prediction branch. In the following we will review how BN can be used to address DA [4, 5, 26] and how a simple variant can be used in the case where we have multiple but unknown source domains [27]. We will then describe how the domain assignment branch can be trained by using the Min-Entropy Consensus loss [30] (Sect. 3.3), building the whole objective for the training procedure.

Fig. 1.
figure 1

Schematic representation of our method applied to the AlexNet architecture (left) and of an mDA-layer (right). The features in input to the domain classifier are perturbed through Dropout [34]. The Min-Entropy Consensus (MEC) loss is then applied to the output of the domain classifier, to enforce the same domain assignment for different perturbations of the same input.

3.2 Multi-domain DA-Layers

BN-based DA methods [4, 5, 20] are a simple yet effective way to tackle the DA problem. Since features extracted by a neural network tend to follow domain-dependent distributions [20], we can align them through domain specific normalization layers. Following [27], let us denote as \(q^d_\mathtt {x}\) the distribution of activations for a given feature channel and domain d. Domain Alignment Layers [4, 5] (\({{\,\mathrm{DAL}\,}}\)) normalize an input \(x^d\sim q^d_\mathtt {x}\) according to

$$\begin{aligned} {{\,\mathrm{DAL}\,}}(x^d; \mu _d, \sigma _d) = \frac{x^d - \mu _d}{\sqrt{\sigma _d^2 + \epsilon }}, \end{aligned}$$
(1)

where \(\mu _d = {{\,\mathrm{E}\,}}_{x\sim q^d_\mathtt {x}}[x]\), \(\sigma ^2_d = {{\,\mathrm{Var}\,}}_{x\sim q^d_\mathtt {x}}[x]\) are mean and variance of the input distribution, respectively, and \(\epsilon >0\) is a small constant to avoid numerical issues. During training the statistics \(\{\mu _d,\sigma ^2_d\}\) are computed over the current mini-batch, thus we apply standard BN but separately for each available d.

The previous formulation requires full domain knowledge (i.e. d) for each sample, something that we do not have in our setting for the source domain. In [27] a variant of the \({{\,\mathrm{DAL}\,}}\) layers called Multi-Domain Alignmet Layers (\({{\,\mathrm{mDA}\,}}\)) has been proposed to tackle this issue. \({{\,\mathrm{mDA}\,}}\) layers exploit the probabilities that a source sample belongs to one of the latent domains. Formally, denoting as \(w_{i,d}\) the probability of \(x_i\) belonging to d and a source mini-batch \(\mathcal {B}=\{x_i\}_{i=1}^\mathsf {b}\), \({{\,\mathrm{mDA}\,}}\) layers normalize \(x_i\) as follows:

$$\begin{aligned} {{\,\mathrm{mDA}\,}}(x_i, \varvec{w}_i; \varvec{\hat{\mu }}, \varvec{\hat{\sigma }}) = \sum _{d\in \mathcal {D}} w_{i,d} \frac{x_i - \hat{\mu }_d}{\sqrt{\hat{\sigma }_d^2 + \epsilon }}, \end{aligned}$$
(2)

where \(\varvec{w}_i=\{w_{i,d}\}_{d\in \mathcal {D}}\), \(\varvec{\hat{\mu }}=\{\hat{\mu }_d\}_{d\in \mathcal {D}}\), \(\varvec{\hat{\sigma }}=\{\hat{\sigma }^2_d\}_{d\in \mathcal {D}}\) and \(\mathcal {D}\) is the set of source latent domains. Notice that \(\mu _d\) and \(\sigma _d^2\) are computed in a weighted fashion:

$$\begin{aligned} \begin{aligned} \mu _d&= \sum _{i=1}^\mathsf {b} \alpha _{i,d} x_i,&\sigma _d^2&= \sum _{i=1}^\mathsf {b} \alpha _{i,d} (x_i - \mu _d)^2, \;\;\;\text {with}\;\;\;\alpha _{i,d} = \frac{w_{i,d}}{\sum _{j=1}^\mathsf {b} w_{j,d}} \end{aligned} \end{aligned}$$
(3)

Equation (2) is used to normalize source samples in our setting, where the domain of each sample is not known a priori. While for the target domain we can directly use (1), this formulation can be easily extended to the case where also the target is a mixture of multiple datasets.

3.3 Min-Entropy Consensus Loss for Domain Prediction

A crucial aspect of \({{\,\mathrm{mDA}\,}}\) layers is the domain assignment \(\varvec{w}_{i}\) that each sample receives. To this extent, as in [27] we employ a domain prediction branch. This branch is composed by a minimal set of layers followed by a softmax operation on \(\mathsf {k}\) outputs. This branch is a different section of the network which shares with the classification part only the bottom-most layers, due to their higher domain specificity [27]. In [27] the domain prediction branch is trained by exploiting an entropy loss. In this work, we argue that we can train a more effective domain prediction branch if we enforce the entropy loss through consensus among domain assignments for perturbations of the same input.

Formally, let us define as \(g^\theta \) the domain prediction branch, parametrized by \(\theta \). We split it into two parts: \(g_{E}^\theta \) and \(g_{D}^\theta \), denoting the feature extractor and the domain classifier respectively. Given the low-level features \(x_i\), in [27] the domain prediction branch produces the domain assignments \(\varvec{w}_{i}\) as follows:

$$\begin{aligned} \varvec{w}_i = g^\theta (x_i) = g_{D}^\theta (g_{E}^\theta (x_i)) \end{aligned}$$
(4)

In order to obtain multiple assignments of perturbed version of the input, we employ a non-parametric random transformation \(\phi \). The assignment of the perturbed sample is obtained by replacing the feature extraction function \(g_E^\theta \) with \(\phi \circ g_E^\theta \):

$$\begin{aligned} \varvec{\hat{w}}_i = g^\theta (x_i) = g_{D}^\theta (\phi (g_{E}^\theta (x_i))) \end{aligned}$$
(5)

where \(\varvec{\hat{w}}_i\) denotes the assignment given to the perturbed features. Since \(\phi \) is random, applying this function multiple times on the same input will produce different outputs. With this in mind we can create a matrix \(\varvec{\hat{W}}_i = [\varvec{\hat{w}}_i^1, \cdots , \varvec{\hat{w}}_i^\mathsf {r}]\) where each element \(\varvec{\hat{w}}_i^j\) is obtained by classifying with \(g_D^\theta \) a different application of \(\phi \) on the features extracted by \(g_E^\theta \).

Since \(\varvec{\hat{W}}_i\) is a set of \(\mathsf {r}\) predictions related to different perturbations of the same sample, we can enforce consistency within \(\varvec{\hat{W}}_i\), obtaining an unsupervised objective for the domain prediction branch. However, as noted in [30], standard consistency loss [9, 33] force only consistent predictions across perturbations of the same sample, without taking into account the actual confidence on the assignment. To this extent, we follow [30] and we employ the Min-Entropy Consensus (MEC) loss as an objective for the domain classifier. Given a set \(\varvec{\hat{W}}_i = [\varvec{\hat{w}}_i^1, \cdots , \varvec{\hat{w}}_i^\mathsf {r}]\), we minimize the following objective:

$$\begin{aligned} \text {MEC}(x_i) = -\frac{1}{\mathsf {r}}\max _{d\in D} \sum _{j=1}^{\mathsf {r}} \log (w^j_{i,d}) \end{aligned}$$
(6)

The domain loss on the full source set is:

$$\begin{aligned} L_\text {dom} = \frac{1}{\mathsf {n}}\sum _{x\in x_\mathcal {S}} \text {MEC}(x_i) \end{aligned}$$
(7)

With (7) we have defined a loss which allows to obtain domain predictions that are both consistent and confident for a given sample. In the experiments we use Dropout [34] as \(\phi \) with ratio 0.5, setting \(\mathsf {r}=2\) as in [30].

To train the full architecture we need to define an objective for the semantic classification part. Following [4, 5, 27] we employ a cross-entropy loss on the labeled source samples and an entropy loss for the unlabeled target ones. Denoting as \(f_C^\theta \) the classification branch we have:

$$\begin{aligned} \begin{aligned} L_\text {cls}(\theta )=&- \frac{1}{\mathsf {n}} \sum _{i=1}^\mathsf {n} \log f_C^\theta (y_i^s; x_i^s)+\frac{\lambda _C}{\mathsf {m}} \sum _{i=1}^\mathsf {m} H(f_C^\theta (\cdot ;x_i^t)). \end{aligned} \end{aligned}$$
(8)

The first term on the right-hand-side is the average log-loss related to the supervised examples in \(\mathcal {S}\), where \(f_C^\theta (y_i^s; x_i^s)\) denotes the output of the classification branch of the network for a source sample, i.e. the predicted probability of \(x_i^s\) having class \(y_i^s\). The second term on the right-hand-side of (8) is the entropy H of the classification distribution \(f_C^\theta (\cdot ; x_i^t)\), averaged over all unlabeled target examples \(x_i^t\) in \(\mathcal {T}\), scaled by a positive hyperparameter \(\lambda _C\). The full objective is:

$$\begin{aligned} \begin{aligned} L(\theta ) = L_\text {cls}(\theta )+\lambda _D L_\text {dom}(\theta ), \end{aligned} \end{aligned}$$
(9)

where \(L_\text {cls}\) is a loss term that penalizes based on the final classification task, while \(L_\text {dom}\) accounts for the domain classification task, with a hyperparameter \(\lambda _D\) balancing the two. We highlight that, due to dependency of the classification branch on the mDA layers, the network learns to predict domain assignment probabilities that also result in a low classification loss. A schematic representation of our architecture is depicted in Fig. 1. Since the semantic classification part needs a single domain assignment for each sample, we set \(\varvec{w}_i\) as the average of the domain predictions on perturbed inputs: i.e. \(\varvec{w}_i = \frac{1}{\mathsf {r}}\sum _j=1^\mathsf {r} \hat{w}_i^j\).

4 Experiments

4.1 Experimental Setup

In our evaluation we consider the following benchmarks: the PACS dataset [18] and the Office-31 [32] dataset.

Office-31 is a widely used DA benchmark which contains images of 31 object categories collected from 3 different sources: Webcam (W), DSLR camera (D) and the Amazon website (A). We test our model on the multi-source setting [36], where each domain is in turn considered as target, while the others as sources. We use this benchmark to compare with [27] as well as previous shallow algorithms [11, 13, 36]. In this setting we use as input to our algorithm the activations of the \(\mathtt {fc7}\) layer of an AlexNet [17] architecture, applying mDA layers to the features and after the domain classifier, as in [27]. The structure of the domain prediction branch is the same of [27], except for the addition of a BN layer (without scale and bias) to the domain logits, since we found that this addition stabilizes the training procedure. The hyperparameters used for training are the same of [27], with \(\lambda _D=0.5\) and \(\mathsf {k}=2\).

PACS [18] is a recently proposed dataset which contains images of 7 categories extracted from 4 different representations, with significant domain shift: i.e. Photo (P), Art paintings (A), Cartoon (C) and Sketch (S). Following [18], we train our model considering 3 domains as sources and the remaining as target, using all the images of each domain. Differently from [18] we consider a DA setting (i.e. target data are available at training time). For the experiments on the PACS dataset we consider the ResNet-18 architecture [12]. As in [27], to apply our approach, we replace each BN layer in the network with an mDA-layer. As in the previous case, the structure of the domain prediction branch and the hyperparameters selected for training are the same of [27], with \(\lambda _D=0.5\) and \(\mathsf {k}=3\) and with the insertion of a BN layer after the domain prediction logits.

We implement all the models with the Caffe [16] framework and our evaluation is performed using a NVIDIA GeForce 1080 GTX GPU. Both the architectures have been initialized with their weights pretrained on ImageNet: for AlexNet we take the pre-trained model available in Caffe, while for ResNet we use the converted version of the original model developed in TorchFootnote 1.

4.2 Results

Analysis of Our Method. In a first series of experiments we compare our model and [27] on the PACS dataset using the ResNet-18 architecture. As a baseline we report the performances of the base architecture, the single source DA model of [5] (DIAL) and the multi-source version of [5] which is our upper bound since it assumes perfect domain separation (Multi-source DA). The results are shown in Table 1. As the table shows our model achieves comparable performances with respect to [27] in average. By analyzing the results it is possible to see that our model performs comparably to the Multi-source DA upper bound in the domains where the gap with the single source baseline is minimal (Photo and Art). However our model largely outperforms [27] when Sketch is used as target. We ascribe this behaviour to the fact that enforcing consistency allows to regularize and strengthen the latent domain discovery process, providing favourable domain separation even when the difference among the domains is less pronounced (as in this case, where Photo, Art and Cartoon are the source domains). At the same time, this regularization could harm the confidence of the domain prediction branch and the statistics estimated by Eq. 2 if the source domains are close. This happens for instance when Cartoon is employed as target, where there two domains (Photo and Art) are close to each other and far from the third domain (Sketch).

Table 1. PACS dataset: comparison of different methods using the ResNet architecture. The first row indicates the target domain, while all the others are considered as sources.

To understand the outcome of the latent domain discovery process, we report histograms analyzing how many samples of a domain receive a given probability to belong to a latent domain. The analysis is shown in Fig. 2. As the figure shows, every time Sketch is among the source domains (yellow bar) almost all its samples are assigned to a single latent domain. Moreover, when Sketch is present, since the difference among the other source domains is more subtle, they tend to receive assignments spread among the other two latent domains, even if with different distributions. This is clear in Fig. 2b where Photo samples tend to be assigned to the first latent domain and Cartoon samples to the second one. Similarly, in the case where Sketch is the target (Fig. 2d), Cartoon samples are assigned to the first latent domains, with Photo samples mainly assigned to the second, and Art samples spread among the three latent domains. This latter outcome is reasonable due to the fact that Art is a domain which is visually intermediate between Photo and Cartoon. A similar effect can be noted when Cartoon and Sketch are both source domains: due to the fact that Cartoon is the closest visual domain to Sketch, its samples may receive probabilities even in the latent domain to which Sketch samples are assigned.

Fig. 2.
figure 2

PACS dataset: analysis of the assignments of source samples to each latent domain. Each row is a different latent domain and each color a different source domain: red for Photo, blue for Art, green for Cartoon and yellow for Sketch. (Color figure online)

To further confirm this analysis, Fig. 3, reports the top images assigned to each of the latent domains. The figure highlights also how the appearance plays a crucial role in the domain discovery process, since the dominant color of an image highly influences its domain assignment. This can be an important aspects for exploring future applications in the real world, where the shift might be caused by changes in e.g. illumination and weather condition.

Fig. 3.
figure 3

PACS dataset: analysis of the top-5 images assigned to each of the latent domains for each source-target scenario. Each row is a different latent domain.

Comparison with the State-of-the-Art. Finally, we compare the performances of our model against state of the art approaches on the Office-31 dataset, using as input \(\mathtt {fc7}\) features of the AlexNet architecture. We compare with deep approach of [27] and with the shallow ones [11, 13, 36], which are among the few approaches tackling the latent domain discovery problem. The results are shown in Table 2. Our model outperforms both shallow [11, 13, 36] and deep [27] methods. Our algorithm obtains a gain of almost 1% in average with respect to the baseline [27], confirming the effectiveness of the proposed training objective for the domain classification branch and the fact that our algorithm performs better than [27] when the difference among the source domains is less marked.

Table 2. Office-31: comparison with state-of-the-art algorithms. In the first row we indicate the source (top) and the target domains (bottom).

5 Conclusions

In this work we have presented an algorithm for addressing the problem of latent domain DA, where the source domain is a mixture of multiple datasets and we do not know the domain membership of each sample. Our method is based on [27], where the latent DA task is solved by employing domain-specific alignment layers. These layers perform a normalization weighted on the probability of a sample to belong to a given domain, with the probability predicted by a domain classifier. While in [27] an entropy loss is employed to train the domain prediction branch, here we propose to use the Minimal-Entropy Consensus (MEC) loss [30] on perturbed version of the features that we provide to the domain classifier for a single sample. Due to the consistency, this loss is more stable with respect to standard entropy and regularizes the domain separation process. Results on the PACS and Office-31 datasets show that our model outperforms all the baselines in Office-31, while achieving similar or higher performances on PACS with respect to [27]. In future works we plan to expand the findings of this work by exploring the impact of using various perturbation and consensus strategies.