Abstract
We propose to extend the marginalized denoising autoencoder (MDA) framework with a domain regularization whose aim is to denoise both the source and target data in such a way that the features become domain invariant and the adaptation gets easier. The domain regularization, based either on the maximum mean discrepancy (MMD) measure or on the domain prediction, aims to reduce the distance between the source and the target data. We also exploit the source class labels as another way to regularize the loss, by using a domain classifier regularizer. We show that in these cases, the noise marginalization gets reduced to solving either the linear matrix system \(\mathbf{A}\mathbf{X}=\mathbf{B}\), for which there exists a closed-form solution, or to a Sylvester linear matrix equation \(\mathbf{A}\mathbf{X}+\mathbf{X}\mathbf{B}=\mathbf{C}\) that can be solved efficiently using the Bartels-Stewart algorithm. We did an extensive study on how these regularization terms improve the baseline performance and we present experiments on three image benchmark datasets, conventionally used for domain adaptation methods. We report our findings and comparisons with state-of-the-art methods.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Unsupervised domain adaptation
- Marginalized Denoising Autoencoder
- Sylvester equation
- Domain regularization
1 Introduction
Domain Adaptation problems arise each time we need to leverage labeled data in one or more related source domains, to learn a classifier for unseen or unlabeled data in a target domain. The domains are assumed to be related, but not identical. The underlying domain shift occurs in multiple real-world applications. Numerous approaches have been proposed in the last years to address textual and visual domain adaptation (we refer the reader to [23, 32, 36] for recent surveys on transfer learning and domain adaptation methods). For text data, the domain shift is frequent in named entity recognition, statistical machine translation, opinion mining, speech tagging and document ranking [3, 11, 33, 41]. Domain adaptation has equally received a lot of attention in computer vision [1, 13–15, 17, 20–22, 29, 34, 35] where domain shift is a consequence of changing conditions, such as background, location or pose, or considering different image types, such as photos, paintings, sketches [4, 9, 25].
In this paper, we build on an approach to domain adaptation based on noise marginalization [5]. In deep learning, a denoising autoencoder (DA) learns a robust feature representation from training examples. In the case of domain adaptation, it takes the unlabeled instances of both source and target data and learns a new feature representation by reconstructing the original features from their noised counterparts. A marginalized denoising autoencoder (MDA) is a technique to marginalize the noise at training time; it avoids the explicit data corruption and does not require an optimization procedure for learning the model parameters but computes the model in a closed form. This makes MDAs scalable and computationally faster than the regular denoising autoencoders. The principle of noise marginalization has been successfully extended to learning with corrupted features [30], link prediction and multi-label learning [6], relational learning [7], collaborative filtering [26] and heterogeneous cross-domain learning [27, 40].
The marginalized domain adaptation refers to such a denoising of source and target instances that explicitly makes their features domain invariant. To achieve this goal, we extend the MDA with a domain regularization term. We explore three ways of such a regularization. The first way uses the maximum mean discrepancy (MMD) measure [24]. The second way is inspired by the adversarial learning of deep neural networks [19]. The third regularization term is based on preserving accurate classification of the denoised source instances. In all cases, the regularization term belongs to the class of squared loss functions. This guarantees the noise marginalization and the computational efficiency, either as a closed form solution or as a solution of Sylvester linear matrix equation \(\mathbf{A}\mathbf{X}+\mathbf{X}\mathbf{B}=\mathbf{C}\).
2 Feature Denoising for Domain Adaptation
Let \(\mathbf{X}^s=[\mathbf{X}_1,\ldots ,\mathbf{X}_{n_S}]\) denote a set of \(n_S\) source domains, with the corresponding labels \(\mathbf{Y}^s=[\mathbf{Y}_1,\ldots \mathbf{Y}_{n_S}]\), and let \(\mathbf{X}^t\) denote the unlabeled target domain data. The Marginalized Denoising Autoencoder (MDA) approach [5] is to reconstruct the input data from partial random corruption [39] with a marginalization that yields optimal reconstruction weights \(\mathbf{W}\) in a closed form. The MDA minimizes the loss written as:
where \(\tilde{\mathbf{X}}_k \in \mathrm{I\!R}^{N \times d}\) is the k-th corrupted version of \(\mathbf{X}=[\mathbf{X}^s, \mathbf{X}^t]\) by random feature dropout with a probability p, \(\mathbf{W}\in \mathrm{I\!R}^{d \times d}\), and \(\omega \Vert \mathbf W \Vert ^2\) is a regularization term. To avoid the explicit feature corruption and an iterative optimization, Chen et al. [5] has shown that in the limiting case \(K\rightarrow \infty \), the weak law of large numbers allows to rewrite \(\mathcal {L}(\mathbf{W},\mathbf{X})\) as its expectation. The optimal solution \(\mathbf{W}\) can be written as \(\mathbf{W}=(\mathbf{Q}+\omega \mathbf{I}_d)^{-1} \mathbf{P}\), where \(\mathbf{P}=\mathbb {E}[\mathbf{X}^\top \tilde{\mathbf{X}}]\) and \(\mathbf{Q}=\mathbb {E}[\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}}]\) depend only on the covariance matrix \(\mathbf{S}\) of the uncorrupted data, \(\mathbf{S}=\mathbf{X}^{\top } \mathbf{X}\), and the noise level p:
2.1 Domain Regularization
To better address the domain adaptation, we extend the feature denoising with a domain regularization in order to favor the learning of domain invariant features. We explore three versions of the domain regularization. We combine them with the loss () and show how to marginalize the noise for each version and to keep \(\mathbf{W}\) as a solution of a linear matrix equation. The three versions of the domain regularization are as follows:
Regularization \(\mathcal {R}_{m}\) Based on the Maximum Mean Discrepancy (MMD) with the Linear Kernel; It aims at reducing the gap between the denoised domain means. The MMD was already used for domain adaptation with feature transformation learning [2, 31] and as a regularizer for the cross-domain classifier learning [13, 28, 38]. In this paper, in contrast to these papers where the distributions are approximated with MMD using multiple nonlinear kernels we use MMD with the linear kernelFootnote 1, the only one allowing us to keep the solution for \(\mathbf{W}\) closed form.
The regularization term for K corrupted versions of \(\mathbf{X}\) is given by:
\(\mathbf{1}^{a,b}\) is a constant matrix of size \(N_a\times N_b\) with all elements being equal to 1 and \(N_s, N_t\) are the number of source and target examples. After the noise marginalization, we obtain \(\mathbb {E}[\mathcal {R}_{m}] =Tr(\mathbf{W}^\top \mathbf{M}\mathbf{W})\), where \(\mathbf{M}=\mathbb {E}[\tilde{\mathbf{X}}^\top \mathbf{N}\tilde{\mathbf{X}}]\) is computed similarly to \(\mathbf{Q}\) in (), by using \(\mathbf{S}_m=\mathbf{X}^\top \mathbf{N}\mathbf{X}\) instead of the correlation matrix \(\mathbf{S}\).
Regularization \(\mathcal {R}_d\) Based on Domain Prediction; It explicitly pushes the denoised source examples toward target instances. The domain regularizer \(\mathcal {R}_{d}\), proposed in [8], is inspired by [18] where intermediate layers in a deep learning model are regularized using a domain prediction task. The main idea is to learn the denoising while pushing the source towards the target (or vice versa) and hence allowing the source classifier to perform better on the target. The regularization term \(\mathcal {R}_d\) can be written as follows:
where \(\mathbf{Z}_\mathcal{D}\in \mathrm{I\!R}^d\) is a domain classifier trained on the uncorrupted data to distinguish the target from the source and \(\mathbf{Y}_\mathcal{T}=\mathbf{1}^N\) is a vector containing only ones, as all denoised instances should look like the targetFootnote 2. After the noise marginalization, the partial derivatives on \(\mathbf{W}\) of this term expectation are the following:

Classification Regularization \(\mathcal {R}_l\); It encourages the denoised source data to remain well classified by the classifier pre-trained on source data. The regularizer \(\mathcal {R}_{l}\) is similar to \(\mathcal {R}_{d}\), except that \(\mathbf{Z}_l\) is trained on the uncorrupted source \(\mathbf{X}^s\) and acts only on the labeled source data. Also, instead of \(\mathbf{Y}_\mathcal{T}\), the groundtruth source labels \(\mathbf{Y}_l=\mathbf{Y}^s\) are usedFootnote 3. In the marginalized version of \(\mathcal {R}_l\), The partial derivatives on \(\mathbf{W}\) can be written as
where \(\mathbf{X}_l=\mathbf{X}^s\) and \(\mathbf{Q}_l\) is computed similarly to \(\mathbf{Q}\), with \(\mathbf{S}_l=\mathbf{X}_l^{\top } \mathbf{X}_l\).
2.2 Minimizing the Regularized Loss
We extend the noise marginalization framework for optimizing the data reconstruction loss () and minimize the expected loss \(\mathbb {E}[\mathcal {L}+ \gamma _{\phi } \mathcal {R}_{\phi }]\), denoted \({{\mathrm{\mathbb {E}}}}[\mathcal {L}_{\phi }]\), where in the regularization term \(\mathcal {R}_{\phi }\), \(\phi \) refers to m, d or l version. From the marginalized terms presented in the previous sections, it is easy to show that when minimizing these regularized losses, the optimal solution for \(\mathbf{W}\) given by \(\partial {{\mathrm{\mathbb {E}}}}[\mathcal {L}_{\phi }]/\partial \mathbf{W}=\mathbf{0}\) can be reduced to solving the linear matrix system \(\mathbf{A}\mathbf{W}=\mathbf{B}\), for which there exists a closed-form solution, or to a Sylvester linear matrix equation \(\mathbf{A}\mathbf{W}+\mathbf{W}\mathbf{B}=\mathbf{C}\) that can be solved efficiently using the Bartels-Stewart algorithm. Due to the limited space, we report all the details in the full version and summarize the baseline, three extensions and the corresponding solutions in Table 1.
Similarly to the stacked MDAs, we can stack several layers together with only forward learning, where the denoised features of the previous layer serve as the input to the next layer and nonlinear functions such as tangent hyperbolic or rectified linear units can be applied between the layers.
3 Experimental Results
Datasets. We run experiments on the popular OFF31 [34] and OC10 [22] datasets, both with the full training protocol [21] where all source data is used for training and with the sampling protocol [22, 34]. We evaluated our models both with the provided SURFBOV and the DECAF6 [12] features. In addition we run experiments with the full training protocol on the Testbed Cross-Dataset [37] (TB) using both the provided SIFTBOV and the DECAF7 features.
Parameter Setting. To compare different models we run all experiments with the same preprocessing and parameter valuesFootnote 4. Features are L2 normalized and the feature dimensionality is PCA reduced to 200 (BOV features are in addition power normalized). Parameter values are \(\omega =0.01\), \(\gamma _{\phi }=1\) and \(p=0.1\). Between layers we apply tangent hyperbolic nonlinearities and we concatenate the outputs of all layers with the original features (as in [5]).
We evaluate how the optimal denoising matrix \(\mathbf{W}\) influences three different classification methods, a regularized multi-class ridge classifier trained on the source (\(\mathbf{Z}= (\mathbf{X}_l^\top \mathbf{X}_l + \delta \mathbf{I}_d)^{-1} \mathbf{X}_l^\top \mathbf{Y}_l\)), the nearest neighbor classifier (NN) and the Domain Specific Class Means (DSCM) classifier [10] where a target test example is assigned to a class based on a soft-max distance to the domain specific class means. Two last classifiers are selected for their non-linearity. Also the NN is related to retrieval and DSCM to clustering, so the impact of \(\mathbf{W}\) on these two extra tasks is indirectly assessed.
Table 2 shows the domain adaptation results with a single source and Table 3 shows multi source results, both under the full training protocol. For each dataset, we consider all possible source-target pairs for the domain adaptation tasks. Hence we average over 9 tasks on OFF31 (with 3 domains A,D,W), and over 12 tasks on OC10 (4 domains (A,C,D,W) and TB (4 domains B,C,I,S).
Table 2 shows the results on L2 normalized DECAF features. It compares the domain regularization extensions to the baselines (BL) obtained with the L2 normalized features (full) and with the PCA reduced features as well as with MDA. As the table shows, the best results are often obtained with MRl, except in the OC10 case where MRd performs better. On the other hand, the \(\mathcal {R}_m\) regularizer (MRm) does not improve the M1 performance. Stacking several layers can further improve the results. When comparing these results to the literature we can see that on OC10 we perform comparably to DAM [14] (84 %) and DDC [38] (84.6 %) but worse than more complex methods such as JDA [29] (87.5 %), TTM [16] (87.5 %) or DAN [28] (87.3 %). On OFF31, the deep adaptation method DAN [28] (72.9 %) significantly outperforms our results. On the TD dataset, in order to compare our results on DECAF6 to CORAL+SVM [35] (40.2 %) we average six source-task pairs (without the domain B) and obtain 43.6 % with MRd+DSCM and 43.1 % with MRl+DSCM. We also outperformFootnote 5 CORAL+SVM [35] (64 %) with our MRd+Ridge (65.2 %) when using the sampling protocol on OFF31.
Concerning the BOV features, the best results (using 3 layers) with the full training protocol on OFF31 are with MRl+NN (29.7 %) and on OC10 with MRd+Ridge(48.2 %). The latter is comparable to CORAL+SVM [35] (48.8 %), but is below LSSA [1] (52.3 %) that first selects landmarks before learning the transformation. The landmark selection is complementary to our approach and can boost our results as well.
In Table 3, we report the averaged results for the multi-source cases, obtained with BOV features, under the full training protocol. For each dataset, all the configurations with at least 2 source domains are considered. It yields 6 such configurations for OFF31 and 16 configurations for OC10 and TB. The results indicate clearly that taking into account the domain regularization improves the performance.
4 Conclusion
In this paper we extended the marginalized denoising autoencoder (MDA) framework with a domain regularization to enforce domain invariance. We studied three versions of regularization, based on the maximum mean discrepancy measure, the domain prediction and the class predictions on source. We showed that in all these cases, the noise marginalization is reduced to closed form solution or to a Sylvester linear matrix system, for which there exist efficient and scalable solutions. This allows furthermore to easily stack several layers with low cost. We studied the effect of these domain regularizations and run single source and multi-source experiments on three benchmark datasets showing that adding the new regularization terms allow to outperform the baselines. Compared to the state-of-the-art, our method performs better than classical feature transformation methods but it is outperformed by more complex deep domain adaptation methods. Compared to the latter methods, the main advantage of the proposed approach, beyond its low computational cost, is that as we learn an unsupervised feature transformation, we can boost the performance of other tasks such as retrieval or clustering in the target space.
Notes
- 1.
Minimizing the distance between the corresponding domain centroids.
- 2.
In the multi source case, \(\mathbf{Z}_\mathcal{D}\in \mathrm{I\!R}^{d \times ({n_S+1})}\), with the columns corresponding \(n_S\) sources and 1 target domain classifiers, and \(\mathbf{Y}_\mathcal{T}\in \mathrm{I\!R}^{N \times ({n_S+1})}\), with \(y_{ns}=1\) if \(s=n_S+1\) and -1 otherwise. N is the total number of instances (source and target).
- 3.
\(\mathbf{Y}_l\in \mathrm{I\!R}^{N_s \times C}\), where \(y_{nc}=1\) if \(\mathbf{x}_n\) belongs to the class c and -1 otherwise. In the multi source case, we concatenate \(n_S\) multi-class \(\mathbf{Z}^a_l\) linear classifiers and the corresponding \(\mathbf{Y}^a_l\) label matrices, where \(\mathbf{Z}^a_l\) was trained on the source \(\mathcal{D}^{s_a}\).
- 4.
Cross validation on the source was only helpful for some of the configurations, for others it yielded performance decrease.
- 5.
Their best results (68.5 % and 69.4 %) obtained with fine-tuned features are not directly comparable as our results can also be boosted when using these fine-tuned features.
References
Aljundi, R., Emonet, R., Muselet, D., Sebban, M.: Landmarks-based kernelized subspace alignment for unsupervised domain adaptation. In: Proceedings of CVPR, pp. 56–63. IEEE (2015)
Baktashmotlagh, M., Harandi, M., Lovell, B., Salzmann, M.: Unsupervised domain adaptation by domain invariant projection. In: Proceedings of ICCV, pp. 769–776. IEEE (2013)
Blitzer, J., Kakade, S., Foster, D.P.: Domain adaptation with coupled subspaces. In: Proceedings of AISTATS, pp. 173–181 (2011)
Castrejón, L., Aytar, Y., Vondrick, C., Pirsiavash, H., Torralba, A.: Learning aligned cross-modal representations from weakly aligned data. In: Proceedings of CVPR, IEEE (2016)
Chen, M., Xu, Z., Weinberger, K.Q., Sha, F.: Marginalized denoising autoencoders for domain adaptation. In: Proceedings of ICML, pp. 767–774 (2012)
Chen, Z., Chen, M., Weinberger, K.Q., Zhang, W.: Marginalized denoising for link prediction and multi-label learning. In: Proceedings of AAAI (2015)
Chen, Z., Zhang, W.: A marginalized denoising method for link prediction in relational data. In: Proceedings of ICDM (2014)
Clinchant, S., Csurka, G., Chidlovskii, B.: A domain adaptation regularization for denoising autoencoders. In: Proceedings of ACL (2016)
Crowley, E.J., Zisserman, A.: In search of art. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 54–70. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16178-5_4
Csurka, G., Chidlovskii, B., Perronnin, F.: Domain adaptation with a domain specific class means classifier. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 32–46. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16199-0_3
Daume, H., Marcu, D.: Domain adaptation for statistical classifiers. J. Artif. Intell. Res. 26(1), 101–126 (2006)
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. CoRR (2013). arXiv:1310.1531
Duan, L., Tsang, I.W., Xu, D.: Domain transfer multiple kernel learning. Trans. Pattern Recogn. Mach. Anal. (PAMI) 34(3), 465–479 (2012)
Duan, L., Tsang, I.W., Xu, D., Chua, T.S.: Domain adaptation from multiple sources via auxiliary classifiers. In: Proceedings of ICML, pp. 289–296 (2009)
Farajidavar, N., deCampos, T., Kittler, J.: Adaptive transductive transfer machines. In: Proceedings of BMVC (2014)
Farajidavar, N., Campos, T., Kittler, J.: Transductive transfer machine. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9005, pp. 623–639. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16811-1_41
Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Unsupervised visual domain adaptation using subspace alignment. In: Proceedings of ICCV, pp. 2960–2967. IEEE (2013)
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation, CoRR (2014). arXiv:1409.7495
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: Proceedings of ICML, pp. 1180–1189 (2015)
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of ICML, pp. 513–520 (2011)
Gong, B., Grauman, K., Sha, F.: Connecting the dots with landmarks: Discriminatively learning domain invariant features for unsupervised domain adaptation. In: Proceedings of ICML, pp. 222–230 (2013)
Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: Proceedings of CVPR, pp. 2066–2073. IEEE (2012)
Gopalan, R., Li, R., Patel, V.M., Chellappa, R.: Domain adaptation for visual recognition. Found. Trends Comput. Graph. Vis. 8(4), 285–378 (2015)
Huang, J., Smola, A., Gretton, A., Borgwardt, K., Schölkopf, B.: Correcting sample selection bias by unlabeled data. In: Proceedings of NIPS, (Curran Associates) (2007)
Klare, B.F., Bucak, S.S., Jain, A.K., Akgul, T.: Towards automated caricature recognition. In: Proceedings of ICB (2012)
Li, S., Kawale, J., Fu, Y.: Deep collaborative filtering via marginalized denoising auto-encode. In: Proceedings of CIKM, pp. 811–820. ACM (2015)
Li, Y., Yang, M., Xu, Z., Zhang, Z.: Learning with marginalized corrupted features and labels together. In: Proceedings of AAAI (2016). arXiv:1602:07332
Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: Proceedings of ICML (2015)
Long, M., Wang, J., Ding, G., Sun, J., Yu, P.S.: Transfer feature learning with joint distribution adaptation. In: Proceedings of ICCV, pp. 2200–2207. IEEE (2013)
Maaten, L.V.D., Chen, M., Tyree, S., Weinberger, K.: Learning with marginalized corrupted features. In: Proceedings of ICML (2013)
Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component analysis. Trans. Neural Netw. 22(2), 199–210 (2011)
Pan, S.J., Yang, Q.: A survey on transfer learning. Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Pan, S.J., Ni, X., Sun, J.T., Yang, Q., Chen, Z.: Cross-domain sentiment classification via spectral feature alignment. In: Proceedings of WWW (2010)
Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 213–226. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15561-1_16
Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: Proceedings of AAAI (2016)
Sun, S.S., Shi, H., Wu, Y.: A survey of multi-source domain adaptation. Inf. Fusion 24, 84–92 (2015)
Tommasi, T., Tuytelaars, T.: A testbed for cross-dataset analysis. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 18–31. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16199-0_2
Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: Maximizing for domain invariance. CoRR (2014). arXiv:1412.3474
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of ICML (2008)
Zhou, J.T., Pan, S.J., Tsang, I.W., Yan, Y.: Hybrid heterogeneous transfer learning through deep learning. In: Proceedings of AAAI (2014)
Zhou, M., Chang, K.C.: Unifying learning to rank and domain adaptation: enabling cross-task document scoring. In: Proceedings of SIGKDD (ACM), pp. 781–790 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Csurka, G., Chidlowskii, B., Clinchant, S., Michel, S. (2016). Unsupervised Domain Adaptation with Regularized Domain Instance Denoising. In: Hua, G., Jégou, H. (eds) Computer Vision – ECCV 2016 Workshops. ECCV 2016. Lecture Notes in Computer Science(), vol 9915. Springer, Cham. https://doi.org/10.1007/978-3-319-49409-8_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-49409-8_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49408-1
Online ISBN: 978-3-319-49409-8
eBook Packages: Computer ScienceComputer Science (R0)