Skip to main content
Log in

Domain adaptation–can quantity compensate for quality?

  • Published:
Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Abstract

The Domain Adaptation problem in machine learning occurs when the distribution generating the test data differs from the one that generates the training data. A common approach to this issue is to train a standard learner for the learning task with the available training sample (generated by a distribution that is different from the test distribution). One can view such learning as learning from a not-perfectly-representative training sample. The question we focus on is under which circumstances large sizes of such training samples can guarantee that the learned classifier preforms just as well as one learned from target generated samples. In other words, are there circumstances in which quantity can compensate for quality (of the training data)? We give a positive answer, showing that this is possible when using a Nearest Neighbor algorithm. We show this under some assumptions about the relationship between the training and the target data distributions (the assumptions of covariate shift as well as a bound on the ratio of certain probability weights between the source (training) and target (test) distribution). We further show that in a slightly different learning model, when one imposes restrictions on the nature of the learned classifier, these assumptions are not always sufficient to allow such a replacement of the training sample: For proper learning, where the output classifier has to come from a predefined class, we prove that any learner needs access to data generated from the target distribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ben-David, S., and Urner, R.: On the hardness of domain adaptation and the utility of unlabeled target samples. In: ALT, pp. 139–153 (2012)

  2. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: NIPS, pp. 137–144 (2006)

  3. Cortes, C., Mansour, Y., Mohri, M.: Learning bounds for importance weighting. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 23, pp. 442–450 (2010)

  4. Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Association for Computational Linguistics (2011)

  5. Gong, B., Shi, Y., Sha, F., Grauman, K: Geodesic flow kernel for unsupervised domain adaptation. In: CVPR, pp. 2066–2073 (2012)

  6. Haussler, D., Welzl, E.: Epsilon-nets and simplex range queries. In: Proceedings of the Second Annual Symposium on Computational Geometry, SCG ’86, pp. 61–71. New York, NY, USA, ACM (1986)

  7. Huang, J., Gretton, A., Schölkopf, B., Smola, A.J., Borgwardt, K.M.: Correcting sample selection bias by unlabeled data. In: NIPS. MIT Press, Cambridge (2007)

    Google Scholar 

  8. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: VLDB, pp. 180–191 (2004)

  9. Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation: Learning bounds and algorithms. In: COLT (2009)

  10. Shalev-Shwartz, S., Ben-David, S.: Understanding machine learning. Cambridge University Press (2014, in press)

  11. Steinwart, I., Scovel, C.: Fast rates for support vector machines. Ann. Statist. 35(2), 575–607 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  12. Sugiyama, M., Mueller, K.: Generalization error estimation under covariate shift. In: Workshop on Information-Based Induction Sciences (2005)

  13. Urner, R., Ben-David, S., Shalev-Shwartz, S.: Supplementay material to: Unlabeled data can speed-up prediction time. http://www.cs.uwaterloo.ca/~rurner/SSLSupplementICML2011.pdf (2011)

  14. Urner, R., Ben-David, S., Shalev-Shwartz, S.: Unlabeled data can speed up prediction time. In: ICML (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruth Urner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ben-David, S., Urner, R. Domain adaptation–can quantity compensate for quality?. Ann Math Artif Intell 70, 185–202 (2014). https://doi.org/10.1007/s10472-013-9371-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-013-9371-9

Keywords

Mathematics Subject Classification (2010)

Navigation