Abstract
Domain adaptation aims to correct the mismatch in statistical properties between the source domain on which a classifier is trained and the target domain to which the classifier is to be applied. In this paper, we address the challenging scenario of unsupervised domain adaptation, where the target domain does not provide any annotated data to assist in adapting the classifier. Our strategy is to learn robust features which are resilient to the mismatch across domains and then use them to construct classifiers that will perform well on the target domain. To this end, we propose novel kernel learning approaches to infer such features for adaptation. Concretely, we explore two closely related directions. In the first direction, we propose unsupervised learning of a geodesic flow kernel (GFK). The GFK summarizes the inner products in an infinite sequence of feature subspaces that smoothly interpolates between the source and target domains. In the second direction, we propose supervised learning of a kernel that discriminatively combines multiple base GFKs. Those base kernels model the source and the target domains at fine-grained granularities. In particular, each base kernel pivots on a different set of landmarks—the most useful data instances that reveal the similarity between the source and the target domains, thus bridging them to achieve adaptation. Our approaches are computationally convenient, automatically infer important hyper-parameters, and are capable of learning features and classifiers discriminatively without demanding labeled data from the target domain. In extensive empirical studies on standard benchmark recognition datasets, our appraches yield state-of-the-art results compared to a variety of competing methods.








Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Notes
Note that we assume the set of possible labels are the same across domains.
The unit-ball condition allows the difference to be represented as a metric in the form of Eq. (13) and the universality ensures that the means are injective such that the difference in the means is zero if and only if the two distributions are the same. For more detailed theoretical analysis, please refer to Gretton et al. (2006).
Note that we do not require the landmarks to be i.i.d samples from \(P_S(X)\)—they only need to be representative samples of \(P_L(X)\).
In the supplementary material for our previously published work (Gong et al. 2012), we report our results on 31 categories common to Amazon, Webcam and DSLR, to compare directly to published results from the literature (Saenko et al. 2010; Kulis et al. 2011; Gopalan et al. 2011). Despite occasional discrepancies between the published results and the results obtained by our own experimentation on these 31 categories, they demonstrate the same trend—that our proposed methods significantly outperform competing approaches.
We did not use dslr as the source domain in these experiments as it is too small to select landmarks.
References
Ando, R., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6, 1817–1853.
Bay, H., Tuytelaars, T., & Van Gool, L. (2006). SURF: Speeded up robust features. In ECCV.
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Wortman, J. (2010). A theory of learning from different domains. Machine Learning, 79, 151–175.
Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations for domain adaptation. In NIPS.
Bergamo, A., & Torresani, L. (2010). Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In NIPS.
Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, Bolly-wood, boomboxes and blenders: Domain adaptation for sentiment classification. In ACL.
Blitzer, J., Foster, D., & Kakade, S. (2011). Domain adaptation with coupled subspaces. In AISTATS.
Blitzer, J., McDonald, R., & Pereira, F. (2006). Domain adaptation with structural correspondence learning. In EMNLP.
Bruzzone, L., & Marconcini, M. (2010). Domain adaptation problems: A DASVM classification technique and a circular validation strategy. IEEE PAMI, 32(5), 770–787.
Chen, M., Weinberger, K., & Blitzer, J. (2011). Co-training for domain adaptation. In NIPS.
Daumé, H., III. (2007). Frustratingly easy domain adaptation. In ACL.
Daumé, H., Kumar, A., & Saha, A. (2010). Co-regularization based semi-supervised domain adaptation. In NIPS.
Daumé, H, I. I. I., & Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1), 101–126.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2009). Pedestrian detection: A benchmark. In CVPR.
Dredze, M., & Crammer, K. (2008). Online methods for multi-domain learning and adaptation. In Proceedings of the conference on empirical methods in natural language processing (EMNLP ’08) (pp. 689–697).
Duan, L., Tsang, I., Xu, D., & Maybank, S. (2009). Domain transfer SVM for video concept detection. In CVPR.
Duan, L., Xu, D., & Tsang, I. (2012). Domain adaptation from multiple sources: A domain-dependent regularization approach. IEEE Transactions on Neural Networks and Learning Systems, 23(3), 504–518.
Duan, L., Xu, D., Tsang, I., & Luo, J. (2010) Visual event recognition in videos by learning from web data. In CVPR.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., Zisserman, A. (2007). The PASCAL visual object classes, challenge 2007.
Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Comp Vis & Img Under, 106(1), 59–70.
Gong, B., Grauman, K., & Sha, F. (2013a). Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In ICML.
Gong, B., Grauman, K., & Sha, F. (2013b). Reshaping visual datasets for domain adaptation. In NIPS.
Gong, B., Shi, Y., Sha, F., & Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In CVPR.
Gopalan, R. (2013). Learning cross-domain information transfer for location recognition and clustering. In CVPR.
Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: An unsupervised approach. In ICCV.
Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A. (2006). A kernel method for the two-sample-problem. In NIPS.
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., & Scholkopf, B. (2009). Covariate shift by kernel mean matching. In J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, & N. Lawrence (Eds.), Dataset shift in machine learning. Cambridge: MIT Press.
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Tech. rep., Caltech.
Ham, J., Lee, D. D., Mika, S., Schölkopf, B. (2004). A kernel view of the dimensionality reduction of manifolds. In ICML.
Hamm, J., & Lee, D. (2008). Grassmann discriminant analysis: A unifying view on subspace-based learning. In ICML.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Berlin: Springer.
Huang, J., Smola, A., Gretton, A., Borgwardt, K., & Scholkopf, B. (2006). Correcting sample selection bias by unlabeled data. In NIPS.
Jain, V., & Learned-Miller, E. (2011). Online domain adaptation of a pre-trained cascade of classifiers. In CVPR.
Kulis, B,, Saenko, K., & Darrell, T. (2011). What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR.
Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. (2004). Learning the kernel matrix with semidefinite programming. JMLR, 5, 27–72.
Leggetter, C., & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9(2), 171–185.
Li, R., & Zickler, T. (2012). Discriminative virtual views for cross-view action recognition. In CVPR.
Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009a). Domain adaptation: Learning bounds and algorithms. Arxiv, preprint arXiv:09023430.
Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009b). Multiple source adaptation and the rényi divergence. In UAI.
Pan, S., Tsang, I., Kwok, J., & Yang, Q. (2009). Domain adaptation via transfer component analysis. IEEE Trans Neural Nets, 99, 1–12.
Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Trans Knowledge and Data Engineering, 22(10), 1345–1359.
Perronnin, F., Sánchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In CVPR.
Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). LabelMe: A database and web-based tool for image annotation. IJCV, 77, 157–173.
Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting visual category models to new domains. In ECCV.
Shi, Y,, & Sha, F. (2012). Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. In ICML.
Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2), 227–244.
Torralba, A., & Efros, A. (2011). Unbiased look at dataset bias. In CVPR.
Vedaldi, A., Gulshan, V., Varma, M., & Zisserman, A. (2009). Multiple kernels for object detection. In ICCV.
Wang, M., & Wang, X. (2011). Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In CVPR.
Weinberger, K. Q., & Saul, L. K. (2006). Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision, 70(1), 77–90.
Zheng, J., Liu, M. Y., Chellappa, R., & Phillips, P. J. (2012) A grassmann manifold-based domain adaptation approach. In ICPR.
Acknowledgments
This work is partially supported by DARPA D11-AP00278 and NSF IIS-1065243 (B.G. and F.S.), and ONR ATL #N00014-11-1-0105 (K.G.). We thank the anonymous reviewers for their constructive comments and suggestions. The Flickr images in Fig. 1 are under a CC Attribution 2.0 Generic license, courtesy of berzowska, IvanWalsh.com, warrantedarrest, HerryLawford, yuichi.sakuraba, zimaowuyu, GrahamAndDairne, Bernt Rostad, Keith Roper, flavorrelish, and deflam.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Hal Daumé.
Appendices
Appendices
1.1 Appendix 1: Derivation of Geodesic Flow Kernel (GFK)
Let \(\varvec{\varOmega }^{\mathrm{T}}\) denote the following matrix
The geodesic flow \(\varvec{\varPhi }(t),\; t\in (0,1)\), between \(\varvec{P}_\mathcal {S}\) and \(\varvec{P}_\mathcal {T}\) can be written as
Recall that the geodesic flow kernel (gfk) is defined as,
where
Substituting the expression of \(\varvec{\varPhi }(t)\) of Eq. (22) into above, we have (ignoring \(\varvec{\varOmega }\) for the moment),
Both \(\varvec{\varGamma }(t)\) and \(\varvec{\varSigma }(t)\) are diagonal matrices with elements being \(\cos (t\theta _i)\) and \(\sin (t\theta _i)\). Thus, we can integrate in close-form,
which become the \(i\)th diagonal elements of diagonal matrices \(\varvec{\varLambda }_1,\; \varvec{\varLambda }_2\), and \(\varvec{\varLambda }_3\), respectively. In terms of these matrices, the inner product Eq. (23) is a linear kernel \(\varvec{x}_i^{\mathrm{T}}\varvec{G}\varvec{x}_j\) with the matrix \(\varvec{G}\) given by
1.2 Appendix 2: Derivation of the Rank of Domain (ROD) Metric
1.2.1 Principal Angles and Vectors
Let \(\varvec{P}_\mathcal {S}\) and \(\varvec{P}_\mathcal {T}\) be the basis of two subspaces. The principal angles \(\theta _i\) between the two subspaces are recursively defined as,
such that
In the above, \(\varvec{s}_i\) and \(\varvec{t}_i\) are called the principal vectors associated with \(\theta _i\). Essentially, principal vectors are new basis for the two subspaces such that after the change of the basis, the two subspaces maximally overlap. The degrees of overlapping are measured in the principal angles—the smallest angles between basis.
Given the singular value decomposition,
both the principal angles and vectors can be computed efficiently
where \(\gamma _{i}\) is the \(i\)th diagonal element of the diagonal matrix \(\varvec{\varGamma }\). \((\varvec{M})_{\cdot ,i}\) returns the \(i\)th column of the matrix \(\varvec{M}\).
1.2.2 Computing ROD
Let \(\varvec{X}_\mathcal {S}\in \mathbb {R}^{\mathsf {N}_S \times \mathsf {D}}\) and \(\varvec{X}_\mathcal {T}\in \mathbb {R}^{\mathsf {N}_T \times \mathsf {D}}\) denote the data from the source and the target domains. We use their PCA subspaces to compute the ROD metric. The optimal dimensionality \(\mathsf {d}^*\) of the subspaces is selected with our subspace disagreement measure, described in Sect. 2.4 in the main text.
The ROD metric integrates both geometrical and statistical information between two domains by
where \(\mathcal {S}_i\) and \(\mathcal {T}_i\) are two one-dimensional distributions of \(\varvec{X}_\mathcal {S}^{\mathrm{T}}\varvec{s}_i\) and \(\varvec{X}_\mathcal {T}^{\mathrm{T}}\varvec{t}_i\), respectively. In other words, we project data onto the principal vectors and compare how (dis)similar the data are distributed across domains.
We approximate these two distributions with one- dimensional Gaussians. Note that \(\varvec{X}_\mathcal {S}\) and \(\varvec{X}_\mathcal {T}\) have zero-means. We thus need only to compute the variances in order to specify the Gaussians. These variances can be readily computed from the projections and the covariance matrices of the original data:
In terms of the approximating Gaussians, the ROD metric is computed in close-form
1.3 Appendix 3: Proof of Theorem 1
We first prove the following lemma.
Lemma 1
Under the condition of the Theorem 1, the following inequality holds,
Proof
We start with
We now use the property that \(\log \) function is concave to arrive at
where
Substituting Eq. (38) into Eq. (37), we have
Applying to the right hand side of the inequality the condition of the Theorem 1, we have
where \(A = \max \left\{ KL(P_N\Vert P_L),\, KL(P_L\Vert P_N)\right\} \).
Note that
as the maximum of \(2\alpha (1-\alpha )+\alpha \) is \(9/8\), attained at \(\alpha = 3/4\). This leads to
To complete the proof the lemma, note that due to the convexity of KL-divergence, we have
Combining the last two inequalities together, we complete the proof of the lemma.\(\square \)
Proof of the Theorem We start by applying the convex property of the KL-divergence again,
where we have applied the Lemma 1 in the penultimate inequality. The last inequality states the desired result of the theorem.
Rights and permissions
About this article
Cite this article
Gong, B., Grauman, K. & Sha, F. Learning Kernels for Unsupervised Domain Adaptation with Applications to Visual Object Recognition. Int J Comput Vis 109, 3–27 (2014). https://doi.org/10.1007/s11263-014-0718-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-014-0718-4