skip to main content
10.1145/1150402.1150422acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Reverse testing: an efficient framework to select amongst classifiers under sample selection bias

Authors Info & Claims
Published:20 August 2006Publication History

ABSTRACT

One of the most important assumptions made by many classification algorithms is that the training and test sets are drawn from the same distribution, i.e., the so-called "stationary distribution assumption" that the future and the past data sets are identical from a probabilistic standpoint. In many domains of real-world applications, such as marketing solicitation, fraud detection, drug testing, loan approval, sub-population surveys, school enrollment among others, this is rarely the case. This is because the only labeled sample available for training is biased in different ways due to a variety of practical reasons and limitations. In these circumstances, traditional methods to evaluate the expected generalization error of classification algorithms, such as structural risk minimization, ten-fold cross-validation, and leave-one-out validation, usually return poor estimates of which classification algorithm, when trained on biased dataset, will be the most accurate for future unbiased dataset, among a number of competing candidates. Sometimes, the estimated order of the learning algorithms' accuracy could be so poor that it is not even better than random guessing. Therefore,a method to determine the most accurate learner is needed for data mining under sample selection bias for many real-world applications. We present such an approach that can determine which learner will perform the best on an unbiased test set, given a possibly biased training set, in a fraction of the computational cost to use cross-validation based approaches.

References

  1. Fan W., Davidson I., Zadrozny B. and Yu P., (2005), An Improved Categorization of Classifier's Sensitivity on Sample Selection Bias, 5th IEEE International Conference on Data Mining, ICDM 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47:153--161.Google ScholarGoogle ScholarCross RefCross Ref
  3. Little, R. and Rubin, D. (2002). Statistical Analysis with Missing Data. Wiley, 2nd edition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. McCallum, A. (1998). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. CMU TR.Google ScholarGoogle Scholar
  5. Mitchell, T. (1997), Machine Learning, McGraw Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Moore, A. A Tutorial on the VC Dimension for Characterizing Classifiers, Available from the Website: www.cs.cmu.edu/~awm/tutorialsGoogle ScholarGoogle Scholar
  7. Rennie, J. 20 Newsgroups, (2003). Technical Report, Dept C.S., MIT.Google ScholarGoogle Scholar
  8. Rosset, S., Zhu, J., Zou, H., and Hastie, T. (2005). A method for inferring label sampling mechanisms in semi-supervised learning. In Advances in Neural Information Processing Systems 17, pages 1161--1168. MIT Press.Google ScholarGoogle Scholar
  9. Shawe-Taylor J., Bartlett P., Williamson R., Anthony M., (1996), "A Framework for Structural Risk Minimisation" Proceedings of the 9th Annual Conference on Computational Learning Theory. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Smith, A. and Elkan, C. (2004). A bayesian network framework for reject inference. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 286--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Vapnik, V., The Nature of Statistical Learning, Springer, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mitchell, T. M. (1997). Machine Learning. McGraw Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the 21th International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Zadrozny and C. Elkan. (2001). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD01). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reverse testing: an efficient framework to select amongst classifiers under sample selection bias

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2006
      986 pages
      ISBN:1595933395
      DOI:10.1145/1150402

      Copyright © 2006 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 August 2006

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader