Abstract
The Domain Adaptation problem in machine learning occurs when the distribution generating the test data differs from the one that generates the training data. A common approach to this issue is to train a standard learner for the learning task with the available training sample (generated by a distribution that is different from the test distribution). One can view such learning as learning from a not-perfectly-representative training sample. The question we focus on is under which circumstances large sizes of such training samples can guarantee that the learned classifier preforms just as well as one learned from target generated samples. In other words, are there circumstances in which quantity can compensate for quality (of the training data)? We give a positive answer, showing that this is possible when using a Nearest Neighbor algorithm. We show this under some assumptions about the relationship between the training and the target data distributions (the assumptions of covariate shift as well as a bound on the ratio of certain probability weights between the source (training) and target (test) distribution). We further show that in a slightly different learning model, when one imposes restrictions on the nature of the learned classifier, these assumptions are not always sufficient to allow such a replacement of the training sample: For proper learning, where the output classifier has to come from a predefined class, we prove that any learner needs access to data generated from the target distribution.
Similar content being viewed by others
References
Ben-David, S., and Urner, R.: On the hardness of domain adaptation and the utility of unlabeled target samples. In: ALT, pp. 139–153 (2012)
Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: NIPS, pp. 137–144 (2006)
Cortes, C., Mansour, Y., Mohri, M.: Learning bounds for importance weighting. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 23, pp. 442–450 (2010)
Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Association for Computational Linguistics (2011)
Gong, B., Shi, Y., Sha, F., Grauman, K: Geodesic flow kernel for unsupervised domain adaptation. In: CVPR, pp. 2066–2073 (2012)
Haussler, D., Welzl, E.: Epsilon-nets and simplex range queries. In: Proceedings of the Second Annual Symposium on Computational Geometry, SCG ’86, pp. 61–71. New York, NY, USA, ACM (1986)
Huang, J., Gretton, A., Schölkopf, B., Smola, A.J., Borgwardt, K.M.: Correcting sample selection bias by unlabeled data. In: NIPS. MIT Press, Cambridge (2007)
Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: VLDB, pp. 180–191 (2004)
Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation: Learning bounds and algorithms. In: COLT (2009)
Shalev-Shwartz, S., Ben-David, S.: Understanding machine learning. Cambridge University Press (2014, in press)
Steinwart, I., Scovel, C.: Fast rates for support vector machines. Ann. Statist. 35(2), 575–607 (2007)
Sugiyama, M., Mueller, K.: Generalization error estimation under covariate shift. In: Workshop on Information-Based Induction Sciences (2005)
Urner, R., Ben-David, S., Shalev-Shwartz, S.: Supplementay material to: Unlabeled data can speed-up prediction time. http://www.cs.uwaterloo.ca/~rurner/SSLSupplementICML2011.pdf (2011)
Urner, R., Ben-David, S., Shalev-Shwartz, S.: Unlabeled data can speed up prediction time. In: ICML (2011)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ben-David, S., Urner, R. Domain adaptation–can quantity compensate for quality?. Ann Math Artif Intell 70, 185–202 (2014). https://doi.org/10.1007/s10472-013-9371-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10472-013-9371-9